作業系統 Operating System 筆記

由於目前筆記已到Hackmd最大長度限制，如有需要可以參考最新版本

Ch1 Introduction

作業系統(Operating system, OS)是管理電腦硬體與軟體資源的電腦程式，同時也是電腦系統的核心與基石。OS 主要有以下兩個功能：

資源分配者
監控使用者程式的執行，以防止不正常的運作造成對系統的危害

一個標準PC的作業系統應該提供以下的功能：

行程管理(Processing management)
記憶體管理(Memory management)
檔案系統(File system)
網路通訊(Networking)
安全機制(Security)
使用者介面(User interface)
驅動程式(Device drivers)

作業系統類別

1. Mainframe Systems

最早的計算機之一，更好的可靠性與安全性，應用在醫院銀行等，由Batch演進到Time-sharing。

Batch
OS 只是將控制權從一個程式轉移到下一個程式，缺點是一次只有一個程式且無法與使用者交互，CPU使用率低。
Multi-programming
系統中存在多組行程同時(concurrent)執行，透過Spooling(Simultaneous Peripheral Operation On-Line)的機制讓CPU跟I/O時間重疊，提升 CPU 利用度(注意，不是並行執行(parallel))。
Multi-programming OS任務有Memory management、CPU scheduling and I/O system。
Time-sharing
CPU在執行的時候頻繁的在跟device做交換，OS在程式間快速切換，每個程式執行幾個ms，time slot就換下一個程式。
Time-sharing OS任務有Virtual memory、File system、disk management、Process synchronization and deadlock。

2. Computer-system architecture

Desktop Systems
Personal computers (PC) – 計算機系統專用於單個用戶有GUI與許多I/O devices方便操作，像是Windows、MacOS、Unix、Linux，但是缺乏對用戶的文件和作業系統保護。
Parallel Systems
也稱為 multiprocessor or tightly coupled system，多個CPU/核心緊密通訊，通常通過共享內存進行通信，優點是Throughput、Economical、Reliability。通常又可分為SMP與AMP。
- Symmetric multiprocessor system (SMP)
  - 每個處理器運行相同的作業系統
  - 最流行的多處理器架構
  - 需要廣泛的同步來保護數據的完整性
- Asymmetric multiprocessor system (AMP)
  - 每個處理器被分配一個特定的任務
  - 一個Master CPU & 多個Slave CPU
  - 常見於超大型系統中
從Processor可分為以下兩種：
- Multi-Core Processor：在同一個die上具有多個core的CPU，在chip上通訊更快，一個多核比多個單核功耗更低。
- Many-Core Processor：一個divice上有幾百個計算核心，像是Nvidia GPU，Intel Xeon Phi，TILE64等。
從Memory可分為以下兩種：
- Uniform Memory Access (UMA)：目前最常見的代表是對稱多處理器(SMP)，相同的處理器，內存訪問次數相等，最常見的架構。
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- Non-Uniform Memory Access (NUMA)：通過物理連接多個SMP來實現，一個SMP可以直接訪問另一個SMP的內存，跨鏈接的內存訪問速度較慢。
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- Distributed Systems
  也稱為loosely coupled system，每個處理器都有本地的內存，透過I/O bus或網路相互通信，容易擴展大量節點像是Internet。
  - Client-Server Distributed System：容易管理與分配資源，但是server會成為瓶頸與單一故障點。
  - Peer-to-Peer Distributed System：每台機器在分佈式系統中的角色都是相同的(去中心化)，像是bitTorrent，Internet。
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - Clustered Systems：集群計算機共享存儲並通過局域網 (LAN) 或更快的互連（例如 InfiniBand（高達 300Gb/s））緊密相連。

3. Special-purpose Systems

Real-Time Systems
- "Real-time"並不意味著速度，而是要keeping deadlines，保證響應和反應時間，Real-time又可以分為Soft real-time與Hard real-time。
  Soft real-time requirements：不期望錯過deadline，但錯過也不嚴重，一個關鍵的real-time task獲得優先於其他任務，並保持該優先級直到它完成，例如多媒體串流。
  Hard real-time requirements：錯過deadline會導致失敗，例如：核電站控制器，Hard real-time通常沒有Secondary storage，資料都存在short term memory或read-only memory (ROM)。
Multimedia Systems
廣泛的應用程序，包括音頻和視頻文件，例如online TV
Embedded Systems
移動裝置，通常內存較小，電池容量有限，處理器速度慢，顯示器較小。

Computer system

用戶(User)：人、機器、其他computer。
應用程序(Application)：定義系統資源用於解決計算問題的方式。
作業系統(Operating System)：控制和協調硬件/資源的使用。
硬件(Hardware)：提供基本的計算資源(CPU、內存、I/O設備)。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Operating System的定義

資源分配器：管理和分配資源以確保效率和公平。
控製程序：控制用戶程序的執行和I/O設備的操作，防止計算機出錯和不當使用。
內核Kernel：一直運行的一個程序(其他都是系統/應用程序)。

Operating System的目標：

方便(使computer system易於使用和計算，適用於小型PC)。
效率(有效率使用電腦硬體，適用於大型、共享、多用戶系統)。

Operating System的重要性：

系統 API 是用戶應用程序和硬件之間的唯一接口，API是為通用目的而設計。
OS code不允許任何錯誤，任何中斷（例如無效訪問）都會導致重啟。
OS technology的擁有者控制著軟硬件產業(Wintel)。
Operating systems和computer architecture相互影響。

Modern Operating Systems

x86 platform：Linux，Windows
Smartphone Mobile OS：iOS，Android
Embedded OS：Embedded Linux，RTOS，Raspberry Pi，Xbox

Computer-System Organization(計算機組織)

一個或多個CPU、設備控制器通過公共總線連接，提供對共享內存的訪問，並發執行CPU和競爭內存週期的設備。
每個device controller都有一個local buffers，I/O是從device到device controller的local buffers，CPU將數據從內存移動到device controller的local buffers。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

I/O的操作有兩種

Busy/wait output
CPU不斷監控等待I/O，但是CPU在監控device時不能做其他工作，很難同時進行其他I/O。
Interrupt I/O
Interrupt允許device改變CPU的控制流，透過改變CPU的控制流在I/O與user process間切換。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Interrupt

現代OS都是中斷驅動的，事件的發生由來自hardware或software的中斷發出信號。

Hardware(signal)：可以隨時通過向 CPU 發送信號signal來觸發中斷。
Software(trap)：可能會因錯誤（被零除或無效內存訪問）或用戶對作業系統服務的請求（系統調用）而觸發中斷，稱為trap。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
中斷的常用功能
中斷一般通過中斷向量將控制轉移到中斷服務例程，中斷向量包含所有服務(即中斷處理程序)例程的地址(函數指針)。
中斷架構必須保存被中斷指令的地址。
在處理另一個中斷時禁止傳入中斷以防止丟失中斷。

Storage-Device Hierarchy

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

存儲系統的結構(速度、成本、揮發性)，其中又可大致分為

Main memory：CPU可以直接訪問的大型存儲介質，RAM(Random Access Memory)。
Secondary storage：主存儲器的擴展，提供非揮發性的大存儲容量，例如Hard disk，tapes。

RAM: Random-Access Memory

DRAM (Dynamic RAM)：主記憶體(Access Speed:>= 30ns)。
SRAM (Static RAM)：CPU cache(Access Speed:10ns~30ns)。

Disk Mechanism：旋轉磁盤，並透過磁頭去去讀取(寫入)磁盤資料

Caching：使用中的資料暫時從較慢的存儲，複製到較快的存儲

Coherency and Consistency Issue：更改register中的副本，使其與其他副本不一致

Single task accessing：沒有問題，總是使用Highest level copy。

Multi-task accessing：需要獲取最近的值。

Distributed system：Difficult b.c. copies are on different computers。

Hardware Protection

1.Dual-Mode Operation

共享系統資源需要OS保證一個不正確的程序不會導致其他程序的錯誤執行，提供硬件支持以區分至少兩種操作模式
1. User mode：代表User執行。
2. Monitor mode：也叫kernel mode或system mode，代表作業系統執行。
hardware中加入Mode bit以指示當前模式的：kernel(0) or user(1)，當interrupt/trap或故障發生時，hardware切換到Monitor mode。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Privileged instructions

Executed only in monitor mode

Requested by users (system calls)

2. I/O Protection

所有I/O instructions都是Privileged instructions，任何I/O設備在user之間共享，必須確保user program永遠無法在monitor mode下獲得對computer的控制權。

3. Memory Protection

Memory Protection，保護中斷向量和中斷服務程序，還有數據訪問與被其他程序over-write。
HW support：兩個合法可使用地址的寄存器，超出定義範圍的內存受到保護
1. Base register：保存最小的合法物理內存地址。
2. Limit register：包含範圍的大小。

4. CPU Protection

-防止user program不返回CPU控制權，造成陷入死循環或無法調用系統服務。
HW support：定時器Timer，在指定時間後interrupts computer，Timer在每個clock中遞減，當Timer達到值0時，發生interrupts。

Timer常用於實現time sharing。
Load-timer is a privileged instruction。

Ch2 OS Structure

OS Services

OS服務可大致分為以下這幾種

User interface(用戶界面)
Program Execution(程式執行)
I/O operations(輸入輸出操作)
File-system manipulation(文件系統操作)
Communication(通訊)
Error detection(錯誤檢測)
Resource allocation(資源分配)
Accounting(計數)
Protection and security(保護和安全)

User interface

CLI (Command Line Interface)
從用戶獲取命令並執行它，Shell(命令行解釋器，如CSHELL、BASH)。
GUI (Graphic User Interface)
通常是滑鼠、鍵盤和螢幕，圖標代表文件、程序、動作等，界面中對像上的各種鼠標按鈕會導致各種操作。

Communication

可以使用消息傳遞或共享內存進行通信。

OS-Application Interface

System calls

運行程序的作業系統接口，請求作業系統服務，通過軟件中斷向內核發出的明確請求，通常是組合語言。

進程控制——中止、創建、終止進程分配/釋放內存
文件管理——創建、刪除、打開、關閉文件
設備管理——讀、寫、重定位設備
信息維護——獲取時間或日期
通訊—發送接收信息

API(Application Program Interface)

用戶主要針對API進行編程而不是system call，通常由language libraries實現，例如 C Library，API 調用可能涉及零個或多個系統調用。

Both malloc() and free() use system call brk()
Math API functions, such as abs(), don’t need to involve system call

三個最常見的API:

Win32 API for Windows
POSIX API for POSIX-based systems (including virtually all versions of UNIX, Linux, and Mac OS X)
Java API for the Java virtual machine (JVM)

Why use API?

簡單(API是為應用程序而設計的)。
可移植性(API是統一定義的接口)。
效率(並非所有功能都需要OS services或涉及kernel)。

System Calls傳遞參數

System Calls傳遞參數通常使用以下三種方法，在運行程式和作業系統之間傳遞參數。

在寄存器中傳遞參數。
將參數存儲在memory中的一個table中，table的address作為參數傳遞到一個寄存器中。
程式將參數push(store)入stack ，作業系統將參數pop出stack。

OS Structure

從用戶角度，作業系統應該易於使用和學習，以及可靠、安全和快速。
從系統角度，作業系統應該易於設計、實施和維護，並且可靠、無錯誤且高效。

1. Simple OS Architecture

只有一層或兩層代碼，缺點:不安全，難以增強改善。

2. Layered OS Architecture

下層獨立於上層，第N層只能訪問第0~(N-1)層提供的服務。優點式更容易調試與維護，缺點是效率較低，難以定義層。

3. Microkernel OS

盡可能多地從kernel移動到userspace像是一個program，通過消息傳遞提供通信，容易擴展和移植。缺點是效能更差，因為所有的動作都要再透過kernel。

4. Modular OS Structure

大多數現代作業系統都實現了kernel modules，與Microkernel不同之處在於Modular OS全部都在kernel space。

使用物件導向的方法(object-oriented approach)。
每個核心組件都是獨立的。
每個模組都通過已知接口與其他模組通訊。
每一個模組都可以根據需要在kernel中掛載。

kernel module實作
http://www.linuxchix.org/content/courses/kernel_hacking/lesson8 http://en.wikibooks.org/wiki/The_Linux_Kernel/Modules
https://www.thc.org/papers/LKM_HACKING.html

5. Virtual Machine

虛擬機採用分層方法得出其邏輯結論，它將hardware和operating system kernel視為硬體。虛擬機提供與底層bare hardware相同的接口，為每個process提供底層計算機的(virtual) copy。
由於VM是執行在user space，當遇到無效的instruction會透過interrupt去call底層的kernel space執行，而目前有hardware support的硬體是透過加一個bit去判斷kernel mode,user mode,VM mode。

虛擬機的困難點在於"critical instruction"難以處理。

VM的用處:

提供對系統資源的完整保護。
解決系統兼容性問題的手段。
作業系統研究和開發的完美工具。
一種提高雲計算資源利用率的方法。
資安作滲透測試。

Vmware (Full Virtualization) 以user mode運行，作為作業系統之上的應用程序，虛擬機認為它們是在bare hardware上運行，但實際上是在user-level應用程序中運行。

Para-virtualization: Xen
向guest提供與guest系統相似但不相同的系統(必須修改guest)，只有一半被虛擬化。在container (zone)processes中認為它們是系統上唯一的進程，Hardware被虛擬化(只安裝一個kernel)。

6. Java Virtual Machine

已編譯的 Java 程式是由Java Virtual Machine(JVM)執行與平台無關的bytecodes，Just-In-Time(JIT)編譯器可提高性能。

JVM consists of

class loader

class verifier

runtime interpreter

Ch3 Processes Concept

Process Concept

Process與Program最主要的差異，在於是否處於執行狀態

Program (Passive entity) 只是儲存於"disk"中等待被執行的程式碼
Process (Active entity) 則是位於記憶體中的正在被執行的"Program"

一個"Process"由以下內容組成： Code segment：也稱text section，將 program (已編譯為instruction) 讀到記憶體中，等待被 CPU fetch 並執行
Data section：global variables
Stack：temporary local variables and functions(方便使用後直接pop掉)
Heap：dynamic allocated variables or classes
Current Activity：管理 Process 的 Meta data，如Program counter, Register Contents
A set of associated resources：相關資源的 set，如 Open file handlers (檔案的開啟、或是跟 OS 要某些硬體的控制權)

1. Threads

Threads 又稱為 Lightweight process，是使用 CPU 的最小單位，同 Process 的 Threads 有共享的記憶體空間。在 Parent Process 創造 Threads 時就會 allocate ，因此省去在空間管理及處理上的執行動作。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

同Process中，Threads共用以下內容
- code section
- data section
- OS resources (e.g. open files and signals)
但以下部份各自獨立
- stack
- register set
- thread ID
- program counter

2. Process State

Process 在其生命週期中，共有 5 個狀態

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

New：當 process 被創建時，將 code 讀到記憶體中，並將前述記憶體狀態初始化
Ready： Process 競爭的資源是 CPU 的核心，管理的方式為佇列 (Queue)，在等待被執行的階段，會被放在佇列當中，此狀態稱為"Ready"
Running：被 CPU 排程選到，執行 instructions。OS 為了確保主控權，一段時間就會將 CPU switch 給 scheduler 而"打斷(interrupt)"程序
Waiting：有些指令不會直接使用 CPU (如 I/O)，因為不需要 CPU ，又需等待某些指令完成，所以進入 Waiting 狀態，待完成後重新放回 Ready
Terminated：Process 完成執行，釋放資源給 OS。

3. Process Control Block (PCB)

User 創建 Process 後，OS 會自動建立 PCB 來做管理。我們說將 Process 放到佇列當中，指得是將 Process 的 pointer 放到佇列中，而非記憶體空間，使用的資料結構為"Linked list"。PCB 包含以下內容：

Process state
Program counter
CPU registers 此外還有：
CPU scheduling information(e.g. priority)
memory management information(e.g. base/limit register):只有 process 在執行時才會把這兩個值從 memory load 進 CPU 的 register。
I/O status/information
accounting information
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

4. Context Switch

CPU 一次只能執行一個 Process，要切換給另一個 Process 時，必須將舊 Process 的資訊 (e.g. PCB) 儲存起來，並載入新 Process 的資訊，這個動作稱為「Context Switch」。
CPU 完整切換的流程如下:

執行 P0
P0 被打斷讓出 CPU ，此時處於 idle 狀態
進行 Context Switch
將 P0 的狀態存於 PCB
並讀取 P1 的 PCB。
完成 Context Switch
執行 P1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Context switch 的過程中， CPU 沒有執行任何 instrunction，因此 Context Switch 其實是 overhead，浪費 CPU 的資源。但為了 CPU sharing 及 time sharing，Context switch 無法避免地經常發生，所以我們只能盡可能縮短時間：

memory speed
number of registers：越少則存取的次數降低，但現在的系統中為了更快速的計算，register 數量是變多的
special instructions：合併 load & save PCB 的 instruction，讓 CPU 一次做完
hardware support：multiple sets of registers，一次記住多個 Process 的狀態，可以快速的在 registers 間進行切換，無須透過 memory access

Process Scheduling

Multiprogramming：多工處理， CPU 可以執行多個 Process，以最大化利用 CPU
Time sharing：分時系統，為一種資源的共享方式，系統可以頻繁地進行 Context Switch，讓多個 User 可以同時使用 Program
由於 Process 共享 CPU，使用的先後順序就是由 scheduler 來決定。

1. Process Scheduling Queues

Job queue (New State)：系統中所有的 Process
Ready queue (Ready State)：在主記憶體中正在及等待執行的 Process
Device queue (Wait State)：等待 I/O 處理的 Process
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

會有一個 ready queue for CPU，很多個 I/O 的 waiting queue，看是哪個 device 或 disk，會有自己獨立的 queue，利用 PCB 中的 pointer 串起來

queue 之間的 tail 和 head 都有串起來，比較容易做 reordering，不同的 Scheduling Algorithm，可能會 dynamically 決定 ordering (run time做調整)，需要其他 pointer 串起來，變成雙向或把前、後記起來

ready queue 可能會有 level 1、level 2、level 3，取決於 Scheduling Algorithm 自身管理方式

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Ready queue 中的 process 會等待被 load 進 CPU
Waiting queue 發生的原因有很多種，OS 會有專門的 queue 去處理這些 event 的發生：
- I/O request：若自己發出 I/O request，這時候會被放到 I/O queue，等到做完 I/O，會 throw 一個 interrupt 回來，OS就會把它放回 ready queue，e.g. call printf()，會被放到 monitor queue
- time slice expired：timer的 alarm 被 fire，會從 CPU(running state) 直接回到 ready queue，這是為了確保主控權可以回到OS的手上
- fork a child：process 可以自己 create process，在Linux中，create process的動作叫做"fork"有兩種實做方式
  1. 可能自己繼續執行讓children等Parent執行完在執行
  2. 先讓children先執行，原來的Parent process會被放在waiting queue，等到Children執行完，才可以再回到ready queue執行
- wait for an interrupt：e.g. call sleep() 的 system call，call sleep() 的 process 會被放在 waiting queue 裡，等著 interrupt 的發生，才可以回到 ready queue 裡面

2. Schedulers

Scheduling 分為 CPU 及 Job Scheduling

CPU Scheduling : 因為 time sharing 的原因，發生頻率非常高，又稱為 Short-term scheduler。CPU Scheduler 會選擇該被執行的 Process 並 Load 到 CPU 當中，從"Ready State" 轉換成 "Run State"
Job Scheduling : 可能幾秒或幾分鐘才需要scheduling，又稱為 Long-term scheduler，通常是使用者人為觸發，新的程式被執行，產生程序時才需要被排程，從 "New State" 轉換成 "Ready State"
Medium-term : 跟虛擬記憶體結合產生的一種 scheduler，因為 memory 空間是有限的，有時會將 disk 的空間拿來使用，當 memory 不夠時，就會將部份 Process 從 memory 踢回 disk，這個動作即由 Medium-term scheduler 來處理，由 "Ready state" 轉換為 "Wait state"
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

3. Long-Term Scheduler

控制在記憶體中 Process 的數量 (degree of multi-programming)
當 degree 太低時， CPU 有很多的時間在 idle
當 degree 太高時，會發生 Thrashing，有太多的 Process 在爭搶有限的 memory，導致 Process 一直在 memory 跟 disk 之間 swap，不斷的做 I/O
因我們希望 CPU 跟 I/O 執行的時間差不多，負載盡可能平衡而最大化系統效能，所以 Process 要能夠在 CPU-Bound & I/O-Bound 良好的混合
在 UNIX/NT 系統中，並沒有 Long-Term Scheduler 直接將 Process 放到 short-term scheduler 中 (現今電腦的記憶體空間通常是足夠的) Multipropramming degree 受硬體限制 (e.g. # of terminals) 或著由使用者自己調整

4. Short-Term Scheduler

執行頻率很高，大約 100ms 執行一次
由演算法來縮短每個 Process 的等待時間
因為效率要高，演算法也不能太過複雜，避免 CPU overhead 過高 overhead = (time of pick) / (time of execution + time of pick)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

5. Medium-Term Scheduler

因現代 memory 空間的增長以及虛擬記憶體的觀念引入，過往由 Long-Term Schduler 處理的動作多改由 Medium-Term Schduler 執行

swap out : 將 Process 由 memory 搬到 disk 中
swap in : 將 Process 由 disk 搬到 memory 中
propose : 改善 Process mix / 減少記憶體中的 Process 數量，降低 degree釋放記憶體空間
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Operations on Processes

回到 programmer 及 user 角度，我們到底如何跟 Process 溝通

1. Tree of Processes

無論什麼作業系統， Processes 一定可以畫成如下的樹狀結構
每一個 Process 由 unique processor identifier (pid) 識別
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

2. Process Creation

資源之間的關聯性可能如下
- Process 的 Parent 及 Child 共享資源
- Child Process 共享部份 Parent 的資源
- Parent 及 Child 完全不分享資訊
兩種可能的執行方式
- Parent 及 Children 同時執行
- Parent 等到 Children 結束後才執行
兩種可能的記憶體空間
- Child 為 Parent 的複製 (執行碼、counter 等與 Parent 完全一樣)，透過 sharing variable 溝通 (address 不同)
- 將特定的 program 丟進 Child 的 code section，重設 counter，讓 Child 變成完全不同狀態的 Process，透過 message passing 溝通

3. UNIX/Linux Process Creation

Reference

fork system call (Create Process)
- 創建新的 (child) process
- 該 process 複製其 parent 的記憶體空間
- Child 與 Parent 在 fork 後同時執行 (execution concurrently)
- Child 在 fork 的回傳值為 0
- Parent 在 fork 的回傳值為 Child process 的 PID
execlp system call (Reset Process)
- 先創建 Process 及 OS 要記憶體空間，並註冊新的 PID
- 將新的 binary file 讀入記憶體，並將內容如變數 (heap、stack)、counter 等全部清空 (reset) ，等於讓新的 Child Process 執行不同的程式
wait system call
- 因 Child & Parent 在 Unix 中是 concurrent 執行，若要控制執行的順序，必須用 wait system call，強制 parent 直到其中一個 Child 的 process 完成後才執行
Memory space of fork()
- 舊的實作 Child 為 Parent 的完整複製
- 新的實作使用 copy-on-wirte 技術，在 Run time 過程中，儲存 Child 的與 Parent 的不同處 (A’s Child 在執行過程中會慢慢增長)
- 若 call execlp，則會產生完全獨立的 Child Process
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

以下為 UNIX/Linux Process Creation 的範例程式碼

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

以下範例簡單示意 fork 的特性， Child Process 會保存 fork() 呼叫當下 Parent Process 的狀態。注意如果過程中呼叫 execlp ，則該分支會斷掉，樹不會繼續往下生長

4. Process Termination

當 Process 執行到最後一個指令或是呼叫 exit()，都會中止程序

該 Process 所有相關資源，如記憶體空間 (physical, virtual)、I/O buffer、open files 等都會清空並將資源還給 OS
Parent 可利用 PID(abort) 來中止特定 Child Process
- 當 Child 配置太多記憶體
- Child Process 執行的任務不再需要
Cascading termination：因為是樹狀結構，所以當 Parent 被中止，則 Child Process 也會被中止，這也是為什麼 "ctrl+C" 可以強制停止 Process，因為其他程序都是由 console 程序創建，更進階的是呼叫 "kill"，但需要管理者權限

Interprocess Communication

定義：在多個 Process (或 Threads) 間溝通的機制
目的：
- 資訊的共享
- 加快計算速度 (not always true)
- 便利性 (performs several tasks at one time)
- modularity (e.g. micro-kernel)

Communication Methods
要做到 IPC 用 memory 分別有幾種方式

Shared memory：共用記憶體空間
- 需要小心處理 synchronization
- 利用 memory address 實作，好處就是快
- 透過 memory 的 address 存取資料
Message passing：透過 memory 的 copy，網路或跨電腦的程式通常都是以這種作法溝通
- 沒有 conflict 的問題，但若資料小，用這個方法的話可以省去 share memory 的 lock 之類的額外資源消耗，反而會略快（且不用在意 sync 的問題）
- 使用 send/recv message
- 以 system call 實作，通常較慢
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Sockets
- 透過 IP & Port 來 identify 使用者，port number 指得就是 Process
- 藉由未結構化的位元資料來溝通 (unstructed stream of bytes)
- 會有 client & server
- server 要先打開 client 才能連
- client 進行 connect 之後，就可以盡情 read / write
- 補充：server 在 accept 之後，會動態開 thread，如此才能一次處理多個 request
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Remote Procedure Calls (RPC)
- 可以呼叫其他的 Process 甚至其他電腦的 Process
- 參數以及回傳值以 message 傳遞
- 與 socket 不同，傳輸的資料是有 data type 的
- Library 通常會有 stubs，在 client 端和 server 端會各有一隻小程式，（server 端的叫 skeleton）會負責 implement 細節
- 其實通常下面也可能用 socket，只是幫 programmer 包起來而已
- parameter mashaling: 把 params 包進 message 且嚴謹的管理。要把兩台電腦串在一起其實有許多細節要處理，RPC 就要負責幫使用者把這些繁瑣的事情解決掉
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  In distributed computing, a remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to execute in a different address space (commonly on another computer on a shared network), which is coded as if it were a normal (local) procedure call, without the programmer explicitly coding the details for the remote interaction.

1. Shared Memory

Process 要能夠 …

創建 shared memory 的區塊
- 共享的記憶體區塊通常會在創建 shared memory 的 Process 的附近
- 透過這個 shared memory 溝通的 Processes，要告知 OS 說我要共用這塊記憶體位置 (attach)
資料格式為 bytes，由 process 自行定義， OS 不處理
Processes 不能同時對同一塊記憶體做讀寫的動作 Consumer & Producer Problem
兩個 Process 共用一塊 Buffer 記憶體
Buffer 是大小為 B 的 circular array
因為 Buffer 有空間限制，必須有以下資訊來控制
- next free: in
- first available: out
- empty: in = out (out 拿光)
- full: (in + 1) % B = out
- 這邊為了 sync，避免無法判斷 in = out 時是 empty 還是 full，必須使用 in+1 留一個空格。之後會提到以 locking 的方式使用所有的 Buffer 空間
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

2. Message Passing

讓 Process 之間可以溝通且同步
IPC 一般來說 OS 會提供兩種操作：Send/Receive 的方式達到同步化，並在背後做各種 memory copy 的處理
- Send(message)
- Receive(message)
Message system : Process 不使用 shared variable 達到溝通的目的
為了 Process 間的溝通
- 必須建立 communication link
- 藉由 Send/Receive 交換資訊
實作 communication link 的方式
- physical (e.g. shared memory, HW bus or network)
- logical (e.g. logical properties, programmer 在意的事情)
  - communication 是否有方向性
  - 角色是否相對
  - Blocking / non-blocking：傳輸行為執行完成才會回傳，或不管對方是否有收到立刻回傳
  - send by copy or by reference
  - 固定大小 / 可變大小的 message

Direct communication (如打電話)

Processes 必須要有對方的名稱
- Send(P, message) : 傳輸 message 給 Process P
- Receive(Q, message) : 從 Process Q 接收 message
communication link 的特性
- Links 通常為自動建立
- links & processes 為一對一的關係
- link 可以是單向，但通常是雙向
Direct communicatioｎ　最大的缺點是關係唯一對一，且連線的對象無法變更，除非重新建立，所以比較難以模組化 (modularity)

Indirect communication (如寄電子郵件)

Messsages 由 mailboxes(或稱為 port) 導向及接受
- 每一個 mailbox 有自己的 ID
- Processes 若共享 mailbox 則可以互相溝通
- Send(A, message)：沒有指定接收者，單純將 message 放到 mailbox A
- Receive(A, message)：沒有指定傳送者，單純將 message 由 mailbox A 接收
communication link 的特性
- Link 必須由 programmer 建立，透過 Process 間共享 mailbox 來溝通
- 因為沒有指定接收者/傳送者，所以 Link 與 Process 間為 Many-to-Many
- Link 可以為單向或雙向
- Mailbox 可以由 Process 或 OS 擁有，通常處理 message 的是另一隻程式
Mailbox Sharing
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 因為可能同時有很多人去讀取 message，但理論上一個 message 應該只能有一個人拿到，有以下幾種解決方法：
  - Link 只允許兩個 Process (一個 send 及一個 receive，等同 direct，最糟的解法)
  - 利用 lock 一次只允許一個 Process 拿取 message
  - 讓 mailbox 自行決定誰來接收。receiver 會先提出要求， mailbox 決定接收者後， sender 會被告知 receiver 是誰，再告知 receiver 來取得 message

Synchronization

Message passing 可以為 blocking (synchronous）或是 non-blocking (asynchronous)
- Blocking send: sender 被 blocked 直到 message 被 receiver 或 mailbox 接收才 return
- Nonblocking send: sender 送出 message 之後繼續執行
- Blocking receive: receiver 被 blocked 直到皆收到 message 才 return
- Nonblocking receive:若 sender 先 call，則 receiver 正常接收 message ; 若 sender 前 call，則 receiver 只會收到之前的值。所以 function call 通常還會 return 一個 token ，來確定是否真的收到 message
Buffer implementation : non-blocking 通常透過 buffer 實作，只要 buffer 空間沒有滿，就可以一直 send ; 對 receiver 來說就是一直接收，不管 buffer 有沒有東西
- Zero capacity ：通常是 blocking send/receive
- Bounded capacity ：若 buffer 滿了， sender 會被 blocked，或是 return 一個 error
- Unbounded capacity ：一直送直到系統滿了為止

3. Socket

socket只負責建立channel，passing由programmer負責。

server端先開一個socket，
bind到一個port，OS才知道哪個人要用這個port做溝通
server要先listen，等人進來
client 開socket，然後connect過去server
server等到有人就accept，建立兩邊通訊
建立起來就可以read，write
結束後兩端可以選擇close，server close後就沒人可以連進去
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

大部分web server或是socket programming，都是設計可以讓許多人連線，如果server是single thread，就只能handle一個user，所以實際上accept後，server side會去creat一個thread(動態建立)。

4. Remote Procedure Calls

socket只能傳bytes，RDC能夠傳struct，類似透過網路remote的functon call。 RDC有stubs在client與server運行，幫忙處理parmeter，把東西package跟unpackage，底層再透過socket傳輸。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

RDC可以跨電腦(eg.big/small endian的問題，32跟64位元int定義長度不同)，marshaling需要處理不同平台的問題；如果要pass pointer大部分的RDC是不支援的，因為pointer指向可能會用到的memory都要copy。

Ch8 Memory Management

Background

主記憶體和暫存器為唯二 CPU 可以直接存取資料的地方
Processes 在硬碟中等待被讀入記憶體中及被執行
多個 program 被讀入記憶體中可以改進資源使用率以及反應時間
Process 在執行期間有可能在硬碟及記憶體之間移動
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

How to refer memory in a program? – Address Binding

Compile Time
程式中的資料、參數等必須要有對應的記憶體位置，早期的作法是在編譯期間，就決定要放在哪裡，送進 CPU 時 CPU 就知道該去哪裡存取資料
- Compiler 會將組合語言轉換成實際的記憶體位置(absolute code)
- 若起始地址變更(可能該地址被其他 process 占住)，則需要重新編譯 (recompile)，等於要關閉並重新執行
- ex: MS-DOS .COM format
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Load Time
- Compiler 不會決定實際的記憶體地址，而是會留一個變數(Base Register)，給出相對位置。程式讀取後(loader，loader 一般假設程式是重新執行，因此會執行非常多動作，如 process 的 control block、記憶體初始化、data structure 的創建等，也就是 program 變成 process 的過程)，才透過該變數決定真正的地址(relocatable code)
- 若 run time 起始地址變更，則需要重新讀取 (reload)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Execution Time
- Compiler 將組合語言轉換成 logical-address(i.e. virtual-address)
- 雖然看似 compile time 決定實際地址，但這其實是個虛擬的地址。
- CPU 以為記憶體地址就是 compiler 轉換的位置，但送要求去記憶體存取資料時，中間還有一個硬體單元 MMU(i.e. Memory-Management-Unit, MMU)，會做記憶體地址的轉換，導向正確的實際地址
  - MMU 將虛擬地址 map 為實際地址
  - 由 logical-address 加上 relocation register 產生實際地址
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - MMU 可以看成 OS 的一部分，但因為使用頻率非常高，所以一般用硬體來實作
  - 多數 general-purpose 的作業系統使用這個方法
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
Logical vs. Physical Address
- Logical address 為 CPU 送出的地址，也就是 virtual address
- Physical address 為記憶體看到的真實地址
- Compile time & Load time address binding 的方式，logical address = physical address
- Execution-time address binding 的方式，logical address != physical
- User program(Programming, Compiling…etc) 只管 "logical" address，永遠不會看到 "physical" address

How to load a program into memory? – static/dynamic loading and linking

Dynamica Loading

不會將整個程式 load 到記憶體，子程式在 function call 的時候才 load
這樣做有更好的記憶體空間使用
- 未用到 routine 不會被載入記憶體
- 在大量程式碼使用率較低時(如 error handling code)特別有用
OS 提供功能，但不會決定 routine 是 dynamic 或 static loading，而是由使用者決定(library, API)
在 C 語言中使用 Dynamic Loading 的範例，當呼叫 printf("%f\n", (*cosine)(2.0))時，cosine才會讀到記憶體中。現在 programmer 使用 Dynamic Loading 常常是希望在 run time 決定該使用哪個函式。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Static Linking

函式庫透過 loader 被結合進 program 的 image 當中，compile 的時候，將函式庫加入程式碼
每一個用到函式庫的 program 都有一份，浪費記憶體資源，但執行比較快
即便改用 Static Linking 加上 Dynamic Loading，因為函式庫仍然需要，因此無法解決程式碼重複複製的問題
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Dynamic Linking

Dynamic Linking 在 run time 才去找連結
因此只需要一份 library 在記憶體當中
OS 在 program 中加入 stub，call stub 的時候會向 OS 確認 library 是否在記憶體中
stub call –> 找 referred lib 是否在 memory 中，沒有的話則 Load 進來 –> 將其位址替換掉，之後就不用經過 stub(relocation)
因為在 run time 時期才找函式庫，執行速度較慢在 windows 中為了系統效能，是編譯成 DLL(Dynamic link library)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

How to move a program between mem. & disk? – Swap

Process 可以從記憶體中 swap 到 backing store，稍後再 swap 回記憶體中執行
- virtual memory 及 mid-term scheduler 使用
- backing store 是與檔案系統(file system)分離的一塊硬碟空間，透過 OS 的 MMU 直接管理
為何需要 Swap
- 釋放 memory space
- 讓重要性較低的 process roll out，roll in 重要性較高的 process
swap back memory location
- 若 binding address 是在編譯或 load 時期決定，則 swap 回去的記憶體位置要相同
- 若 binding address 是在 run time 動態決定，則 swap 回去的記憶體位置可以不同
程序被 swap 前， OS 會確認是否是閒置的(idle)，除 CPU 執行外，也不能執行 I/O
- 若想 swap 在 I/O pending 的 process，有以下解法：
  - 乾脆不要 swap 的 I/O pending 的 process
  - 做完的 I/O 先copy 資料到 OS buffers(memory space 不屬於任何 user process)，就能把 process swap 出去，swap 回來時，只要 OS buffer 中的資料複製到 process 的 buffer 即可
swap 主要時間花在資料的傳輸，transfer time 與資料傳輸的大小成正比，該如何決定被 swap 的 process 非常重要，詳見 CH9 - virtual memory

Contiguous Memory Allocation

相關資料 CS:APP Malloc Lab 解題筆記

Memory Allocation

Fixed-partition allocation:
- 每個 process 讀進一個 fixed-size 的 partition
- Multi-programming 的程度由 partition 的數量決定
Variable-size partition (Multiple Partition)
- Hole 定義為一塊連續的記憶體位置
- 不同大小的 Holes 散布在記憶體當中
- 當有 process 抵達，會被 allocate 在一個夠大的 hole 裡頭
- OS 會理 in-use 及 free 的 hole 的資訊
- 一個被釋放的 hole 能夠跟其他 hole 合併成一個更大的 hole
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Dynamic Storage Allocation Problem

問題：如何從一串 free holes 中找出符合大小 n 要求的 hole
解法：
- First fit - 1st 符合的 hole 就直接配置
- Best fit - 配置大小最剛好的 hole
  - 可能要搜尋整個 hole list
- Worst fit - 配置最大的 hole
  - 可能要搜尋整個 hole list
- First fit and Best fit 無論在速度或空間使用上都比 Worst fit 好

Fragmentation

External fragmentation (frag space 在 process 外)
- 全部的 free memory space 是足夠滿足需求的，但卻不是連續的
- 發生在 variable-size allocation
- 解法： Compaction
  - 將記憶體空間做 shuffle，在 execution time 讓 free memory 結合成一整塊
  - 只有 binding address 在 run time 決定才能使用
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →

Non-Contiguous Memory Allocation — Paging

相關資料 MIT6.s081 Lab: page tables

Paging Concept

Method:
- 將 physical memory 切成 fixed-sized blocks 稱為 frames
- 將 logical address space 切成一樣大小的 blocks 稱為 pages
- 為了要 run 一個有 n pages 的 program，需要找 n 個 free frames 並載入 program
- OS 會持續追蹤 free frames，以 page table 建立 logical to physical address 的轉換關係
Benefit:
- 允許 process 的 physical-address space 可以是不連續的
- 避免 external fragmentation
- 盡可能降低 internal fragmentation
- 提供 shared memory/pages

Page Table

每一個 entry 對應到在 physical memory 中的 page 的 base address
每一個 process 有一個 page table，由 OS 直接進行管理
- Page table 只包含 process 擁有的 pages
- process 無法存取不是其 space 的記憶體地址
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Address Translation Scheme

Logical address 分成兩部份
- Page number ( p )
  - 為 page table 的 key 值，對應 value 為 page 在 physical memory 的 base address
  - N bits 代表一個 process 最多可配置 2^N 個 pages 的記憶體，限制住 programm 可 allocate 的大小 (例如 32-bits 的電腦，一個 program 最多 allocate 4GB 左右的空間)
- Page offset ( d )
  - 與 base address 結合定義 physical memory address
  - N bits 代表 page size 為 2^N
- 由 page number 找出放在第幾個 page，再加上 page offset 得到實際的記憶體位置
physical address = page base address + page offset
如果 page size 為 1kb(2^10) 且 page 2 map 到 frame 5
若 logical address 為 13 bits (p=2, d=20)，則 physical address 為？
- 5 * 1kB + 20 = 1,010,000,000,000 + 0,000,010,100 = 1,010,000,010,100
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
pages 的總數量不一定和 frames 的總數量相同
- pages 的總數量決定 process 的 logical memory size
- frames 的總數量則與 physical memory 大小有關
例如：給定一個 32-bits 的 logical address，36 bits 的 physical address 以及 4KB 的 page size，這代表？
- page table size = 2^32 / 2^12 = 2^20 個 entries
- Max program memory: 2^32 = 4GB
- Total physical memory size = 2^36 = 64GB
- Number of bits for page number = 2^20 pages -> 20 bits
- Number of bits for frame number = 2^24 frames -> 24 bits
- Number of bits for page offset = 4KB page size = 2^12 bytes -> 12 bits

Free Frames

OS 用 free frame list 記下所有 free 的 frame

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Page/Frame Size

page & frame size 由 hardware 定義
- 通常是 2 的冪次
- 通常從 512 bytes 到 16 MB/page
- 最常見的是 4KB/8KB(32bit -> 4KB；64bit -> 8KB)
Internal fragmentation?
- Larger page size -> 更多空間的浪費
但因為以下原因， page sizes 有變大的趨勢
- 記憶體、process、data sets 變得更大， process 變大代表要 access 的 page 也越多，若 page size 太小一來找尋的次數增加，二來因為 page 太多很可能部份的 page 放在 disk 中，降低執行及讀取速度
- 更好的 I/O performance(during page fault)
- page table 可以比較小

Page Summary

使用者看到的是連續性的記憶體，但實際上是分散的
OS 為每個 process 維持一個 page table
OS 維持 frame table 管理 physical memory
- 每個 frame 有自己的 entry
- 表示 frame 是 free 還是 allocated
- entry 記 " process id " 和 " page number "

Implementation of Page Table

Page table 保存在記憶體中
因為做 translate 的為 MMU，而 MMU 是硬體單元，所以 Page table 在記憶體中的位置會存在 Page-table base register (PTBR)
- 記憶 Page table 在記憶體的位置
- PTBR value 存在 Process Control Block 裡頭(PCB)
- 在 Context switch 時更新 PTBR value (除了 CPU 的 register，MMU 的 register 也要重新 load)
每一次讀取記憶體實際上動作有兩步
- 第一步先找到 Page Table，mapping 後才到真實的記憶體地址存取資料
兩步讀取的動作可以由 Translation Look-aside Buffers (TLB) 解決，簡單來說是一個 cache，若 cache hit，則可以縮短為 1 步，以 Asscociative Memory 實作

Asscociative Memory

Associative Memory 為硬體，所有的 memory entires 可以同時被讀取 (查詢為 O(1))
- 每一個 entry 對應一個 associative register
但 entries 的數量是有限制的，一般來說是 64~1024 左右
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Translation Look-aside Buffer (TLB)

由 Associative Memory 組成，是硬體單元
page table 的快取被所有的 process 共享
Context switch 後，因為 process 不同，page table 也不同，一般會 flush 整個 TLB; 另一個解法是在 TLB entry 增加 PID 的欄位，同時符合 page number 及 PID 才叫做 hit (address-space identifiers(ASIDs))
因為 context switch 後 TLB 被清空，等於剛開始記憶體地址都要兩步才能取得，是 context switch 拖慢效能的主要原因
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Effective Memory-Access Time

20ns for TLB search
100 ns for memory access
Effective Memory-Access Time (EMAT)
- 假設 70% TLB hit-ratio: EMAT = 0.7x(20+100) + (1-0.7)x(20+100+100) = 150 ns
- 假設 98% TLB hit-ratio: EMAT = 0.98x120 + 0.02x220 = 122 ns

Memory Protection

page table 中，每一個 page 都有 protection bit ，表示 “唯讀” 或 “可讀寫”
valid-invalid bit
- Valid: page/frame 在 process 的 logical address space，process 可以讀取
- Invalid: page/frame 不在 process 的 logical address space ，禁止 process 存取
- 早期會先創建固定大小的 page table，因為 program 會長，有可能目前 page 的數量少於 page table 的大小，若此時有 pointer 不小心指到該處，就會報 segmentation fault
  - 缺點是可能很多 entry 沒有用到 -> 使用 page table length register (PTLR)，動態變更 page table 的大小。
  - program 配置記憶體以 byte 為單位，而 valid-invalid bit 以 page 為單位，因此需要 memory limit register 紀錄 program 真正使用的記憶體空間，限制無效的存取
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →

Shared Pages

允許 page 在不同 processes 間共享相同的 Reentrant code(read only)
Reentrant code (Pure code)
- 在執行期間不會改變
- 如 text editors, compilers, web servers, etc
在 physical memory 中只需要保存一份 shared code
Process 保有自己的 private data 以及 code
數個 virtual address map 到相同的 physical address
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Page Table Memory Structure

Page table 有可能很大而且難以存取
- 4GB (2^32) logical address space with 4KB (2^12) page
  - 要有 1 million(2^20) page table entry
  - 若一個 entry 4 bytes，則 page table 要 4MB，又因為 page table　是給 MMU 讀取的，MMU 沒有 address translation，所以這 4MB 的空間必須連續，但 physical address 通常很難找到 4MB 的連續空間
- 需要將 page table 切分成數個較小的 page tables，最好在一個 page size 大小(4KB) 的範圍內
- 或著將 page table 的總大小限縮
解法：
- Hierarchical Paging
- Hash Page Tables
- Inverted Page Table

Hierarchical Paging

將 page table 拆成數個較小的 table，將 logical address space 化成多層 page tables (n-level page table)，但因為多層的結構，實際上總 entry 是增加的。ex: A[1000] -> A[10][100] -> A[10][10][10]

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

64-bit Address
- 如果切成 42(p1) + 10(p2) + 12(offset)
  - outer table 需要 2^42 X 4B = 16TB 的連續記憶體!!
- 如果切成 12(p1) + 10(p2) + 10(p3) + 10(p4) + 10(p5) + 12(offset)
  - outer table 需要 2^12 X 4B = 16KB 的連續記憶體
  - 但記憶體存取次數變成 6 倍

Hashed Page Table

通常用在 address > 32 bits
Virtual page number 會被 hash 進 hash table
Hash table 的大小影響 bucket 的 chain 的長度
每一個 hash table 的 entry 包含以下資訊
- Virtual Page Number, Frame Number, Next Pointer
不過 pointer 的結構會浪費額外記憶體。另外，linked-list 搜尋時間為 O(N)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
一個改善的實作是將 linked-list 的 element 設為 array，降低 linked-list traverse 的次數
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Inverted Page Table

不使用 page table，而是改用 frame table， entry 為 PID + Page Number
若使用 page table，當 process 數量越來越多， page table 的數量跟著增加
- frame table 只需要一個，可以事先配置，且使用率接近 100%，減少儲存 table 的記憶體需求
- 但存取時要搜尋整個 table，增加存取的時間
- 最麻煩的是難以實作 page 的 sharing，所以這種作法比較少見
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Non-Contiguous Memory Allocation — Segmentation

Segmentation

以使用者的角度來決定如何切割記憶體
一個 program 可以看成 segments 的集合，segment 是一個 logic unit，可能包含：
- main program
- function, object
- local/global variables
- stack, symbol table
- arrays, etc…
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Segmentation Table

Logical address:(seg#, offset)
- Offset 可以描述的長度不受 program 長度限制，只受限於系統定義 program 最多可配置多少記憶體有關
Segmentation table - map 二維的 physical address，每一個 entry 有
- Base(4 bytes): physical address 的起始位置
- Limit(4 bytes): segment 的長度
Segment-table base register (STBR)：儲存 table 的 physical addr.
Segment-table length register (STLR): 儲存 segment 的數量

Segmentation Hardware

limit register 檢查 offset 有沒有超過範圍
MMU 指派一個合適的 base address 給 segment 以配置 memory
- segment 之間的 physical address 不能重疊
s 為 table 的 index， d 是 offset，若 offset 超過 limit，就發生 segmentation fault; 若符合條件，與 page table 不同， physical address 為 base + offset 直接相加沒有單位(page) 的概念
每一個 segment 的 control 是獨立的，如 heap, stack 允許的空間可能不同
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Address Translation Comparison

Segment
- Table entry: segment base address, limit
- Segment base address 可以為任意值
- Offset 的最大值等同於 physical memory size
Page
- Table entry: frame base address
- Frame base address = frame number * page size
- Offset 的最大值等同於 page size

Example of Segmentation

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Segments 的 protection bits
- Read only segment (code)
- Read-write segments (data, heap, stack)
Code sharing 發生在 segment level
- Shared memory communication
- Shared library
不同的 segment tables 有相同的 base address 就共享 segment

Segmentation with Paging

基本概念

在 logical address space 使用 segmentation，映射到 pages 時再切，compiler 不用管 physical address 怎麼對應
在 physical address space 使用 paging，藉由 MMU 做地址的映射，提高記憶體使用的效率
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Address Translation

CPU 產生 logical address
- 交給 segmentation unit 後產生 linear address
- Linear address 交給 page unit
- 產生 physical memory address
Segmentation & pageing units 都由 MMU 處理
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Example: Intel Pentium

Logical-address space 分成兩部份
- 1st: 8K(2^13) segments(private), local descriptor table
- 2st: 8K(2^13) segments (shared), global descriptor table
- private & share 會是兩個分開的 segment
Logical address
- 一個 process 其 segments 的數量為 2^14 = 16K
- segment 的大小 <= 2^32 = 4GB
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Segmentation

Segment descriptor
- Segment base address & length
- 存取權利及優先級
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Paging (2 Level)

Page size 可以是 4KB 或 4MB
將 32 bits 切成 page directory(outer page table), page tables(inner page table) & offset
- 每一個 page directory entry 有 indication 用的 flag，代表使用 4KB 或是 4MB 的 page，系統會動態決定，若是 4MB 則 32 bits 只切兩份，後面的 bits 都給 offset
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

範例題目

physical mem size = 512B 代表 seg offset 為 9 bits
page size = 32 B 代表 page offest 為 5 bits
8 個 segments linear address 前 3 bits 是 segment table 的 index 現有 12 bits 的 logical address = 448 -> 010|001001000

segment index = 010 = 2
對應到 001110110
加上 segment offset 001110110 + 001001000 = 0101|11110
因為 page offset 為 5 bits，所以 page table index = 0101 = 5
對應到 physical frame number = 2
所以最終結果為 0010|11110
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Ch9 Virtual Memory Management

Background

為何我們不希望執行 program 時，將所有內容放進 memory 中
- 有許多程式碼負責不常使用的 erros 或 conditions
- 部份 program routine 不常使用
- 很多 program 使用相同的 libary
- 已經配置但沒有被使用的 Arrays, lists & tables
我們希望更好的 memory 使用率，真的需要使用的才放入 memory 中
Virtual memory: 將 user logical memory 從 physical memory 分離
- 因為 logical -> physical 只是 mapping 的問題，所以 logical memory 可以比 physical memory 來的大，可以執行非常大的 process
- 提高 CPU/resources 的使用率，越多 process 在 virtual memory 中，degree 越高，就越容易找到需要某個資源的 process 去使用
- 簡化 compiling 的工作，不需要管 memory 的限制
- 提昇 programs 的執行速度，因為不用將整個 program 讀進 memory，有較低的 I/O load & swap 的負載
Virtual memory 可以透過以下兩種方式實作
- Demand paging，因為是 fix-size，對硬體使用來說資源分配比較容易管理
- Demand segmentation -> 因 segmentation 大小不固定，要找到空間的複雜度較高，效率較低，但對 user 來說比較容易理解及使用 (heap, stack…etc)
Virtual Memory vs. Physical Memory
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Demand Paging

當 Page 被需要時才讀進 memory 中，好處是
- 較少的 I/O 需求 –> 更快的響應
- 較少的 memory 需求 –> 可以有更多 programs 執行
當 Page 有 reference 指向它時，代表 Page 被需要
- 若 reference invalid –> abort (沒有 allocate)
- Not-in-memory –> 利用 paging 讀入記憶體 (page fault)
pure demand paging
- process 啟動時完全沒有 page
- 只有當 page 需要時才讀進 memory
swapper (midterm scheduler) 管理整個 process ，而 pager 則是管理 process 的一個 page 為單位
硬體支援
- Page Table: valid-invalid bit 來表示 page 是否在記憶體中
  - 1 -> page in memory
  - 0 -> page not in memory
  - 一開始該 bits 都設成 0
- Secondary memory (swap space, backing store): page 從 memory 被 swap 出去的儲存空間，通常是 high-speed disk

Page Fault

第一次的 reference 會引發 trap to OS
- 稱為 page fault trap

OS 會查看在 PCB 裡頭的 internal table ，確認是 Invalid 或只是 not in memory，決定後續動作
找到一個 empty frame (可以置放 page 的空間)
將 page 從 disk (swap space) 移到 frame
設定 Page-table，將 valid-invalid bit 設為 1
重啟 instruction
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Page Replacement

若 page fault 時沒有 free frame 的話
- Swap 一個 frame 回 backing store
- Swap 目標 page 進 frame

Demand Paging Performance

Effective Access Time (EAT): (1 – p) * ma + p * pft
- p 為 page fault rate, ma 為 mem. access time, pft 為 page fault time
- mem. access time 包含找到位置及將資料從 memory chip 搬到 CPU register 的時間
- page fault time 則多了資料在記憶體 - disk 之間 swap 的時間
- pft 遠大於 ma，EAT 基本上由 pft 及 page fault rate 決定
- 假設 ma = 200 ns, pft = 8 ms，若 1000 次產生 1 次 page fault，則速度就會慢上 40 倍 ; 換個角度如果我們希望 page fault 對讀取時間的影響小於 10%，則 page fault rate 必須小於 0.0000025
Programs 的 refernce 通常是 locality 的，代表 program 通常會存取在鄰近位置的 memory address
- 一個 page fault 可以讀取 4KB 的資料進來
- 大幅降低 page fault 發生的機率
Page fault time 的時間組成
1. page-fault 中斷
2. 將 page 從 disk 中讀進 memory (最花時間)
3. 重啟 process

Process Creation

Process & Virtual Memory

Demand Paging: 當有需要時才 swap page 進 memory
Copy-on-Write: parent 跟 child processes 一開始共享 frames，當有內容發生不同時，才會 copy frame
Memory-Mapped File: 將 file 的 disk content 當作 memory content，用 virtual address space 做 mapping，避開 file system，增加 I/O 效能。通常適用於大檔案，因為 paging 需要 4KB 的空間，小檔案使用會有記憶體空間碎片化的問題

Copy-on-Write

允許 parent & child 共享 frames
若其中一者變更 frame，則僅對該 frame 進行 copy 的動作
COW 使得 process creation 有效率(e.g. fork())
通常 free frame 時會完全清空，避免前一個 process 使用的一些 content 被其他 process 看見，造成潛在的安全性問題
When a child process is forked
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
After a page is modified
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Memory-Mapped Files

方法：
- MMF 透過 disk block to memory frame 的 mapping，讓 I/O 像一般的記憶體存取，略過 file system calls (eg. read(), write())
- file 一開始使用 demand paging 讀取，之後的讀寫動作視作一般的 memory access
好處：
- 略過 file system calls (如 read(), write())
- 允許多個 processes 共享 file 的內容
壞處：
- 安全性、資料遺失、更多的 programming efforts
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
範例：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Page Replacement

Page Replacement 的演算法會嚴重影響系統的速度，假設演算法經常將需要用到的 pages swap 去 disk，則 pages 在 memory 跟 disk 一直來回的時間會大幅拖慢系統的效能

Page Replacement Concept

當沒有 free frame 時發生 page fault
- 將整個 process swap 掉，釋放所有 process 佔有的 free frames，但這樣 process 就完全不能使用，並不是一個好方法
- 採用 page replacement ，找到一個目前沒有被使用的 frame 將其 free 掉
  - 使用 dirty bit 表示 page 的內容是否被修改過，若 page 沒有被修改過，就直接 free 掉 frame 的內容; 有修改過才將資料寫回 disk ，來減少 page transfers 的 overhead，
replacement 必須要處理兩個問題
- free-allocation algorithm: 演算法必須決定多少 frames 應該 allocate 給 process
- page-replacement algorithm: 選擇哪些 frames 被取代

Page Replacement Step

在 disk 中找到需要的 page
若有 free frame，則直接使用; 若沒有 free frame 則執行 page replacement 演算法來選擇被取代的 frame
將需求的 page 讀進 free frame 中，並更新 page & frame tables
重啟 process 的指令

Page Replacement Algorithms

目標：最低的 page-fault rate
評估：一連串的 page id (reference string, memory refernces) 為輸入，計算 page fault 發生的次數

FIFO algorithm

在佇列中最舊的 page 會被取代
範例：
- 假設 available memory frames = 3
- 產生 9 次 page fault
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Belady’s Anomaly: 即便增加 frames 的數量，page fault 的數量不一定會減少，還有可能會增加
- 假設 available memory frames = 4
- 產生 10 次 page fault
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
因為 Belady’s Anomaly，使得我們很難對系統進行調整，例如插越多條記憶體，反而執行速度有機會變慢

Optimal (Belady) Algorithm

取代未來最長時間不會被使用的 page
但實際上我們不會有未來的資訊
因此這個演算法比較類似提供我們一個基準
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

LRU Algorithm (Least Recently Used)

一種 optimal algorithm 的近似
- 因為無法預測未來，所以我們往回看
取代最長時間沒有被使用的 page
因為 locality 及時間複雜度低(O(1))的緣故，LRU 很常被使用
Counter implementation
- 紀錄每個 page 最近被使用的 time-stamp
- 取代 time-stamp 最舊的 page
- 時間複雜度 O(N)
Stack implementation (實際使用)
- 當 page 被使用時，將 page 移到 double-linked list 最前方
- 可以用 hash map 紀錄 double-linked list 每個 element 的指標，達成 O(1) 的實作
- 取代 linked list 中位在 tail 的 page
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Stack Algorithm

Optimal & LRU 都是 stack algorithm，有以下這些特性

n frames 的 pages 一定是 n+1 frames pages 的 subset，也就是說因為 stack 的特性，先被放入的元素一定會先被看到，若增加 frames，也只是在原 stack 後面增加一些 pages 而已
Stack algorithm 不會有 Belady’s anomaly 問題

LRU 近似的演算法

additional-reference-bits algorithm
second-chance algorithm：2 個 LRU stack
enhanced second-chance algorithm

Counting Algorithms

LFU Algorithm (least frequently used)
- 每一個 page 有 counter
- 經常被使用的 page 應該要有大的 count value
MFU Algorithm (most frequently used)
- 不過常用不代表正在用，有可能在某些時刻集中使用，將 count 拉高，但多數時間沒有使用
- 最小的 count 有可能是剛帶進來的，還沒被使用，所以應該留著
兩者都不常用
- 要一直執行加法成本很高
- 有 overflow 的問題
- 表現不一定較佳

Allocation of frames

每個 process 最少都需要一定數量的 frames

Frame Allocation

Fixed allocation: 根據 process 在一開始就決定要 allocate 多少 frame
- Equal allocation: 每個 process 配置到的 frames 數目相同
- Proportional allocation: 根據 process 的大小來配置
Priority allocation
- 在 run time 的時候，根據 process 的優先度決定配置
- 如果 process 發生 page fault
  - 選擇 process 本身的一個 frame 來替代
  - 或著選擇低優先度的 process 的 frame 來替代
Local allocation: 只能替換自己的 frames，通常跟 fixed allocation 綁在一起
Global allocation: 從所有的 frames 中選擇一個來替代
- 可以拿其他 process 的 frame
- 例如讓高優先度的 process 搶低優先度的 process
- 有好的系統表現，所以較常使用
- 會有一個問題是如果一直搶低優先度 process 的 frame，有可能該 process 甚至無法維持最低的 frames 數量，所以要有機制來維持最低 frames，避免 process 被 thrashing

Thrashing

如果一個 process 沒有足夠的 frames，代表 process 在記憶體中的 frames 數量不足以支撐 process 執行，導致 page fault 一直在發生 (且 page fault 是 instruction level，無法 overlap)，等於 CPU 要一直等待 I/O，造成 CPU 使用率反而下降的情況

Thrashing 的定義為：當一個 process 花在 paging 的時間比執行時間更高時，稱為 thrashing
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Thrashing 造成的 performance 問題
- processes 因 page fault 在等待 I/O
- 因此 CPU 使用率下降
- OS 發現 CPU 使用率下降，增加 degree of multiprogramming
- 新產生的 processes 從舊的 processes 搶 frames，導致更多的 page fault
- 結果 CPU 的使用率繼續下降
為了避免 Thrashing，必須要控制 degree of multiprogramming

Working-Set Model

從 Locality 的角度思考，對一個 process 而言，我們可以計算在這段時間內使用多少 pages，這個數量就是要 assign 給 process 的 frames 數量
因為 process 的需求是動態變化的，所以 locality 會隨時間改變，OS 必須要持續追蹤 processes 的 locality
- 例如 program structure (如 subroutine, loop, stack)
- 或 data structure (array, table…etc)
Working-set Model
- Working-set window: 觀察的一段時間
- Working set: 這一段時間內，被 process reference 到的 page 數
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- 每一個 process 的 working set size
  $W S S_{i}$
- frames 的總需求數
  $D = \sum W S S_{i}$
- 如果 D > m (available frames) ，則有 thrashing 發生
- OS 會監控每個 process 的
  $W S S_{i}$ ，並配置足夠的 frames 給每個 process
  - 如果 D << m，增加 degree of multiprogramming
  - 如果 D > m，把整個 process suspend (mid-term scheduler) 掉，free 該 process 所有的 frames
優點

在最大化 multiprogramming 的情況下，同時防止 trashing 的發生
可以最大化 CPU 使用率

缺點
1. tracking 的成本很高

Page Fault Frequency Scheme

監控每個 process 發生 page fault 的機率，控制 page fault rate 在一定範圍內來防止 thrashing 的現象
- 依據 page fault rate 為每一個 process 建立 upper 跟 lower bounds
- 如果 page fault rate 超過 upper limit，則配置更多的 frames
- 如果 page fault rate 低於 lower limit，則移除部份的 frames
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Working Sets and Page Fault Rates

重點是 memory 的 locality 會變化，所以需要追蹤 process 的使用情況，動態地調整 processes 及 frames 的配置情況

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Ch4 Multithreaded Programming

Threads

同 Process 中，Threads 共用以下內容
- code section
- data section (global variable, heap)
- OS resources (e.g. open files and signals)
但以下部份各自獨立
- stack
- register set –>可能 instruction 的執行位置不同，甚至執行不同的 function call
- thread ID
- program counter
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Motivation

web browser
- 如一個 thread 負責顯示文字，其他 thread 接收資料等
web server
- One request / process: poor performance
- One request / thread: 因為 code 及 data sharing，效率較高
- 會有一個 main thread 一直在等待，當接收到新 request 時，會產生一個 thread 專門處理，讓 main thread 可以繼續等待連接
RPC server
- One RPC request / thread
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Benefits of Multithreading

Responsiveness : 即使 program 的一部分被 block 住，或是執行冗長的操作，仍可繼續運行，加速響應時間
Resource sharing : 不同的 threads 可利用相同的記憶體位置來共享資源
Utilization of MP arch : 因為現在都是多核心，因此 thread 可以在不同 processors 上平行執行
Economy : thread 無須記憶體空間管理及處理，建立 Thread的 cost 比 process 來得小。而且因為 thread 共用 process 資源，thread 之間的 context switch 也比 process 來得快。
Multithreaded programming 提供一種更有效率使用多核心的機制，改善共時性 (concurrency)
多核心系統使 system designer 以及 application programmer 面對更多挑戰
- 對 OS 設計者 : 如何設計 scheduling algorithms 讓多核心可以平行執行
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Challenges in Multicore Programming

Dividing activities : 將 program 拆成多個可以同時處理的任務
Data splitting : 除了計算任務要拆，資料也要能夠拆開給不同任務處理。有時候 data 很複雜，不一定是平分給所有任務處理
Data dependency : shared data 的同步，是平行程式中最複雜且重要的部份
Balance : 各個任務的執行時間要差不多，否則會受限於執行最久的任務
Testing and debugging 非常困難

User vs. Kernel Threads

User threads : 由 user 透過 threads library 進行的 thread 控制。雖然由 user 創建，但實際 thread 要執行任務時， OS 會將該 thread binding 到某一個 kernel thread 上，每次可對應到不同的 kernel thread。
- POSIX Pthreads
- Win32 threads
- Java threads
- 所謂 user thread 就是 thread library，更重要的是 user level 的 thread，不由 kernel 直接控制，通常在創建及管理上會比較快
- 但如果 kernel thread 只有一個的話，即便有多個 user threads，因為有 binding 的關係，只要某個 user thread 執行時呼叫 I/O 或 sleep()，則其他 threads 就沒辦法再使用該 kernel thread
Kernel threads : 由 kernel(OS) 直接控制
- Windows 2000(NT)
- Solaris
- Linux
- Tru64 UNIX
- 由 kernel 進行創建及管理，速度較慢
- 若某個 thread 被 block，仍可由 kernel 安排其他 threads 執行工作

Multithreading Models

Kernel mapping threads 的方式通常有以下幾種：

Many-to-One
- 許多 user thread mapping 到一個 kernel thread
- 在不支援多 trheads 的 kernel 使用
- Thread 管理在 user space 完成，效率高
- 但整個 process 可能被 block
- 一次只有一個 thread 可以 access kernel，多執行緒在多處理器的環境下也無法平行化
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
One-to-One
- Windows XP/NT/2000
- Linux
- Solaris 9 and later
- 每個 user thread mapping 到一個 kernel thread
  - 因為 kernel thread 數量通常受到限制，所以 user thread
- More concurrency
- 因為創建一個 user thread 必須同時創建一個對應的 kernel thread，所以有較高的 overhead
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Many-to-Many

多個 user threads 對應到相同或較少數量的 kernel threads，而且是在 runtime 期間做 mapping
user threads 數量不受限制
kernel threads 可以在多核心系統中平行執行
當某個 thread 被 block 時，可安排其他 kernel thread 來執行
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Shared-Memory Programming

定義：Processes 透過 shared memory space 進行溝通及協作
- 比 messgae passing 更快也更有效率
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
有許多議題需要處理：
- Synchronization
- deadlock
- Cache coherence
Programming 的技巧
- Parallelizing compiler
- Unix Processes
- Threads (Pthreads, Java)

What is Pthread

POSIX (Portable Operating System Interface，只定義行為不定義實作，也就是 API 層面必須一模一樣) 標準針對 Unix 類的不同系統 (MacOs, Linux…等)，程式碼只需要重新編譯而不用修改即可執行
- Message passing 的 library MPI 也有相同概念
Pthread 即是定義在 POSIX 標準下的 thread 庫

Pthread Creation

pthread_create(thread, attr, routine, arg)
- 有時我們會指定某個 thread 榜在某個指定的 core 上，因為不同 core 的 L1 cache 不共享，這個行為可以透過 attr 參數來控制
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Pthread Joining & Detaching

pthread_join(threadId, status)
- Block 直到特定 threadId 的 thread 結束
- 達成 threads 之間同步化的一種方法
- 範例：產生 pthread barrier
```
for (int i = 0; i < n; i++) pthread_join(thread[i], NULL)
```
- 要注意的是，若今天 thread 執行完之後，我們不需要把資料讀回來，則不用 call pthread_join，但多數的 thread 庫會要求 programmer call pthread_detach，告訴 library thread 結束後就直接 free 掉。若不去 call detach， OS 無法知道 thread 是否之後會被 join，導致 thread 的回傳值一直在等待，程式無法結束
pthread_detach(threadId)
- 一但 thread 被 detached，則永遠無法被 join
- Detach 一個 thread 可以 free 掉 thread 的資源
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Linux Threads

Linux 在 kernel 不 support multithreading(因為 Linux 只有 process 或 task)
User-level 可以使用 Pthreads 來實作多執行緒
fork system call: 產生新的 process 並完全複製 parent 的 data 及程式執行狀態
clone system call: 產生新的 process 並控制哪些 segment 要與 parent share 哪些不需要。因為 Linux 沒有 thread ，所以用 clone system call 來達到 thread 的概念
- 有一系列的 flag 來控制資料共享的程度，共享的資料以 pointer 的方式指向
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Semantics of fork() and exec()
- 若 process 有兩個 thread，其中一個 thread call fork() 可能有以下兩種情況
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- 部份 UNIX 系統支援兩種 fork() 的方式
- execlp() 會取代整個 Process，而不是單一 thread
Thread Cancellation
- 一個 thread 若執行結束後， main thread 必須有一系列 cancel 的動作
  - Asynchronous cancellation : main thread 會馬上中止 target thread
  - Deferred cancellation (default option) : target thread 會預設一些中斷點，main thread 會週期性地確認是否執行到中斷點，等 target thread 執行到中斷點才中斷。
Signal Handling
- Signals(synchronous or asynchronous) 在 UNIX 系統中用來告知 process 某個事件發生
  - synchronous : 如不合法的記憶體存取
  - asynchronous : control-C
- Signal handler 處理 signals 的順序
  1. Signal 被特定事件產生
  2. Signal 被送到 process
  3. Signal 被處理
- Signal 處理的 Options
  - 傳送 signal 給 deliver signal 的 thread
  - 傳送 signal 給所有 thread
  - 傳送 signal 給特定的 thread
  - 直接由 main thread 來處理 signal，如 file handler 的操作

Thread Pools

很多應用是直接產生一定數量的 threads，避免動態 thread 的產生、刪除時造成的 overhead，程式執行比較有效率，如 web server
優點
- 直接接收 request 通常比創建一個新的 thread 再接收 request 來得快
- 可以控制 threads 數量，進而控制系統的資源用量
threads pool size 的決定: 利用 # of CPUs, 預期的 # of request 以及記憶體大小來決定

Ch6 CPU Scheduling

相關資料

Basic Concepts

Multiprogramming
- keep 多個 process 在 memory 中
- 一次只有一個 process 在 CPU 執行，其他的 process 處於 waiting 狀態
- 當執行狀態的行程必須等待資源才能往下執行（I/O request），作業系統拿回CPU的控制權交給等待的行程
Process Scheduling 的目的是讓 CPU 在每個時刻都有工作可以做，增加使用效率
Process 基本上不是在執行 instruction，就是在執行 I/O，執行一連串的 instrunction 又稱為 burst(CPU burst & I/O burst)
- 一般來說，多數的 CPU burst 時間很短，少部份的 CPU burst 時間很長
- 一個 I/O Bound(I/O 密集) 的 Program 通常有很多短的 CPU burst (若沒有很長的 CPU burst，則通常是 I/O Bound)
- 一個 CPU Bound(計算密集) 的 Program 可能有一些很長的 CPU burst (如很長的 for loop)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

CPU Scheduler

從 Ready Queue 選擇誰可以被執行 (也就是將 Process 讀到 CPU 當中)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Preemptive vs. Non-preemptive

CPU scheduling 的決定時機主要有以下幾個
1. 從 running 換到 waiting state
2. 從 running 換到 ready state (CPU burst 被打斷，time sharing)
3. 從 waiting 到 ready state
4. 終止狀態
Non-preemptive scheduling: 若一個 process 可以繼續執行(在 CPU burst)，則不會被打斷
- 因此只會在上面 1. 跟 4. 的情況下作 re-scheduling
- 例如 Window 3.X
Preemptive scheduling
- 在所有情況都有可能 re-sheduling
- 如 Windows 95 及之後的版本, Mac OS X

Preemptive Issues

Preemptive 最大的缺點是 synchronization 的問題
- 需要同步化
- 導致存取 shared data 的問題與對應的 cost
對 OS kernel 設計的影響
- 在 kernel 中做 lock & unlock 的設計
- Unix 的解決方法是: 當在 run OS kernel 的 instruction 時，會把 timer disable，從 preemptive 變成 non-preemptive

Dispatcher

由 scheduler 決定誰被執行，dispatcher 執行換人的動作，其負責
- switching context
- 從 kernel mode 轉換到 user mode
- process 換了之後，page table 的 pointer、program counter(load 到新的 state)
Dispatch latency: dispatcher 執行換人動作，停止一個 process，啟動另一個 process 所需的時間
- Scheduling time
- Interrupt re-enabling time
- Context switch time

Scheduling Algorithms

Scheduling Criteria

CPU utilization
- 理論上是 0% ~ 100%
- 實際上是 40% ~ 90%
Throughput(從系統的角度)
- 單位時間完成的 processes 量
Turnaround time(從 single job 的角度)
- submission ~ completion（從ready ~ terminate的時間）
Waiting time
- Process 執行期間在 ready queue 中等待的時間
- 所以 I/O Burst 不算在 waiting time
Response time
- submission ~ 第一個 response 產生(第一個 CPU burst 開始執行)

Algorithms

FCFS Scheduling
- 依 (Burst time) 抵達順序執行(先來的先執行)
  - P1(24), P2(3), P3(3)
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - Waiting time: P1=0, P2=24, P3=27
  - Average Waiting Time (AWT): (0 + 24 + 27)/3 = 17
- Non-preemptive
- Convoy effect: 很多 process 在等待一個很長 CPU Time 的 process，造成平均等待時間大幅增加的不良現象
Shortest-Job-First (SJF) Scheduling
- 時間最短的 process 先處理
- 運用行程下一個 CPU burst 的長度，而非 CPU burst total 長度
- SJF 的 average waiting time 一定是最短的
- Schemes
  - Non-Preemptive SJF: 當一個行程拿到 CPU，不會被搶佔直到他完成
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - Preemptive SJF: 當有新的行程且他的 CPU burst 的長度比較小，搶佔發生
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
Approximate Shortest-Job-First (SJF)
- SJF 的難處是在執行下一個 CPU burst 之前，無法得知實際的執行長度
- Approximate SJF: 下一個 burst 可預測為過去 CPU burst 長度的 exponential average
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Priority Scheduling
- 每一個 process 有一個 priority number
- CPU 被分配給高優先度的行程
  - Preemptive
  - Non-preemptive
- SJF 可視為優先度的排程，其優先度是下一個預測的行程 CPU burst time
- 問題：Starvation – 優先度低的 process 可能不會被執行
- 解法：aging – 隨時間增加，若 process 還沒被執行到，則增加優先度
  - ex: 每 15 分鐘增加優先度 1
Round-Robin(RR) Scheduling
- process 執行時，可在 CPU 執行的時間(time quantum)有限制，通常是 10~100 ms
- 達規定的執行時間後，程式被 preempted 並加到 ready queue 的尾端
- Performance
  - TQ Large –> 類似 FIFO
  - TQ small –> (context swtich) overhead 增加
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
Multilevel Queue Scheduling
- Ready queue 切成分開的 queue
- 同一個 queue 通常放類似功能的 process
- 每一個 queue 有自己的排程演算法
- 因為還是只有一個 queue 的程式可以執行，所以 queue 之間也有排程，常見的作法是用權重的方式隨機挑選一個出來
- Fixed priority scheduling：有 starvation 問題，若最上層的 queue 先做，下層的 process 可能一直做不到
- Time slice：每個 queue 分配一定的 CPU time
Multilevel Feedback queue Scheduling
- process 執行的狀況在 run time 才能得知，如過去 CPU burst 多少、呼叫了哪些 system call 等等，系統會在 run time 接收這些資訊並分類放到合適的 queue 中
- process 可以在不同的 queue 中移動，因為也是一種 priority queue，也會有 starvation 的問題，通常搭配 aging 實作
- 一定先執行上層的 queue
- 依照 process CPU burst 的特性分類
  - I/O-bound 及 interactive 的 process 在較高層的 queue –> short CPU burst
  - CPU-bound 的 process 在較低層的 queue –> long CPU burst
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
- 當新的 job 抵達 Q0，以 FCFS (First Come First Serve) 演算法排程，若沒辦法在 8ms 內完成，則將 job 移到 Q1
- 同樣地 Q1 也是 FCFS ，若 job 在 16ms 內還是沒執行完，則被 preempted 並丟到 Q2
- 只有當 Q0 ~ Qi-1 是空的時候， Qi 中的 job 才會被執行
- 下一次回來 ready queue 時，系統會依照 feedback 判斷放在哪個 queue 中 (feedback 資訊會放在 PCB 中)
- 通常 scheduling 演算法由以下參數決定
  - queues 數量
  - 每個 queue 的排程演算法
  - 提昇/降級一個行程的方法

Evaluation Methods

Deterministic modeling
- 以預定義的 workload 及演算法表現的好壞，缺點是難以通用化
Queueing model
- 數學分析
Simulation
- 以隨機數字產生器或 trace tapes 產生 workload
Implementation
- 最準確的方式就是直接實作並觀察

Multi-Processor Scheduling Multi-Core Processor Scheduling Real-Time Scheduling

Multi-Processor Scheduling

Asymmetric multiprocessing
- 所有系統的執行會由一個 master processor 掌控
- 其他 processors 只執行 user code (由 master 配置)
- 遠比 SMP 簡單
Symmetric multiprocessing (SMP)
- 每一個 processor 都是 self-scheduling
- 所有 processors 共用 ready queue，或是每一個 processor 有自己的 ready queue
- 需要同步機制

Processor affinity

Processor affinity: process 與執行它的 processor 之間有 affinity 關係
- process 會將最近常使用的資料放在執行它的 processor 的快取中
- 快取的無效及資料重新放入是高成本的行為
Solution
- soft affinity: 允許 process 在不同 processor 執行
- hard affinity: 只能在同一個 processor 執行

NUMA and CPU Scheduling

NUMA (non-uniform memory access)
- 發生在結合 CPU 與 memory boards 的系統中
- CPU scheduler 及 memory-placement 會一起工作
- process 被配置在有 affinity 關係的 CPU 的 memory board
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Load-balancing
- 讓不同 processors 間的 workload 盡量平均
- 只在 processor 有 private queue 的系統下才需要
- 兩種策略
  - Push migration: 將 processes 移動到閒置或 less-busy 的 processor 中
  - Pull migration: 閒置的 processor 將等待中的 task 從其他忙碌的 processor 中拉過來
  - 通常是平行實作
- Load balancing 經常抵銷 processor affinity 帶來的效益

Multi-core Processor Scheduling

Multi-core Processor:
- 更快且更少的 power 消耗
- memory stall: 當存取記憶體時，花費許多時間在等待資料 available (e.g. cache miss)
Multi-threaded multi-core systems
- 兩個(或更多)的硬體 thread 被 assign 給每一個 core (i.e. Intel Hyper-threading)
- 當一個 thread 在存取記憶體時，其他 thread 就可以執行 CPU 指令
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Two ways to multithread a processor
- coarse-grained: 當 memory stall 發生時切換到另一個 thread，因為 instruction pipeline 必須被 flush 所以成本很高
- fine-grained(interleaved): 把 pipeline 的狀態保留並切換到另一個 thread 執行，需要更多的 register 來保存資料
Scheduling for Multi-threaded multi-core systems
- 1st level: 選擇某個 software thread 執行在每個 hardware thread(logical processor)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- 2nd level: 每個 core 如何決定執行哪一個 hardware thread

Real-Time Scheduling

Real-time 不代表越快越好，重點在 deadline 前要完成
Soft real-time
- 雖然不希望超過 deadline，但並不會馬上出問題
- 例如影音串流
Hard real-time
- 若超過 deadline 則導致 fundamental failure
- 例如核電廠的控制
Hard real-time

Real-Time Scheduling Algorithms

FCFS scheduling algorithm – Non-RTS
- T1 = (0, 4, 10) == (Ready, Execution, Period)
- T2 = (1, 2, 4)
Rate-Monotonic (RM) algorithm
- 依據頻率的大小來做排程
- 更短的週期(週期是固定的數值，在 run time 期間不會變動) –> 更高的優先度
  - 同一個 task 的所有 job 都有一樣的 priority
  - task 的優先度是固定的
- Fixed-priority RTS scheduling algorithm
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Earliest-Deadline-First (EDF) algorithm
- 更早 deadline –> 更高的優先度
- 動態優先的演算法
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Operating System Examples

Solaris Scheduler

Priority-based multilevel feedback queue scheduling
有六類 scheduling，一個 process 只會屬於某一類:
1. real-time
2. system
3. time sharing
4. interactive
5. fair share
6. fixed priority
每一個類有自己的 priority & scheduling 演算法
Scheduler 會將類的優先度轉換為 global 的優先度
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Solaris Example(time sharing, interactive)
- 優先度與 time slices 成反向關係，time slice 越小的優先度越高
  - Time quantum expired: 當 thread 在沒有 blocking 的狀況下使用整個 TQ，則重新排優先度
  - Return from sleep: 若 thread 從 sleeping(I/O wait)回來 ready queue 時，會重新排優先度
  - 期望 CPU-bound 的 process 慢慢往下沉，I/O-bound 的 process 慢慢上浮
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →

Windows XP Scheduler

Multilevel feedback queue
- 優先度分為 0~31，由優先度最高的 queue 開始排程
- 優先度最高的 thread 永遠在執行
每一個 priority queue 使用 Round-robin
除了 Real-time 的 task 之外，優先度在 run time 時期動態變更
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Linux Scheduler

Preemptive priority based scheduling
- 只有 user mode processes 可以被 preempted
- 兩個不一樣的 process priority ranges
- 值越低優先度越高
- TQ 較高者有更高的優先度
Real-time tasks: (priority range 0~99)
- static priorities
Other tasks: (priority range 100~14-)
- 依 task 執行狀況動態決定優先度
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Scheduling algorithm
- 若一個 task 還有剩餘的 TQ，則 task 可以被執行，只要還沒耗盡 TQ，都會保留在 active array，希望每一個 task 都能用完它的 TQ
- 當 task 耗盡 TQ，則為 expired 且不應被執行，移到 expired array
- task expire 後，系統會決定新的優先度以及 TQ
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Ch5 Process Synchronization

當共享的資料同時被不同 Process / threads 存取時，因為執行順序的不確定性，很容易發生 data inconsistency 的狀況，所以需要額外的機制來確保程式的正確性，也就是 Synchronization 以下為常見的 Consumer & Producer Problem

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

除了 Buffer 之外， Counter 也是 share variable，要注意的是 counter++ 及 counter-- 在 instruction level 其實有三道指令，以 counter++ 為例，要先將 counter 從記憶體中移到 register，加上 1 之後，再存回記憶體。由於 context switch, Preemptive scheduling 等因素，這些指令可能不是一次執行完成，導致執行結果不符合預期

move ax, counter
add ax, 1
move counter, ax

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Race Condition

當多個 Processes 同時存取及操作 shared data 時，最終的值取決於最後一個完成執行的 Process，這個現象稱為 Race Condition

在 single-processor machine 中，我們可以 disable interrupt 或著使用 Non-preemptive scheduling 來達成同步，但在 User Program 中不可能使用，會影響整個系統的運作
通常將可能產生 Race Condition 的區域稱作 Critical Section

Critical-Section Problem

目的：建立 Processes 之間合作的 protocal
Problem description
- N 個 Processes 競爭使用 shared data
- 每一個 Process 存取 shared data 的 code segment 我們稱為 critical section
- 若我們確保當一個 Process 執行 critical section 時，其他 Processes 不得執行 critical section，稱之為 mutually exclusive，是比較暴力的解法
- 一般常見 critical section 的架構如下
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Critical-Section Requirements

Mutual Exlusion：
- 當一個 Process 執行 critical section 時，其他 Processes 不得執行 critical section
Progress：
- 若沒有 Process 在執行 critical section，且有某些 Processes 希望進入它的 critical section，則 Process 不得在未定義的情況下被推遲，一定要進的去
Bounded Waiting：
- Process 等待進入 critical section 的期限必須受到限制，不可一直排隊等待

Critical-Section Solution & Synchronization

Algorithm for Two Processes

我們先以兩個 Processes 的簡單例子做講解
- 只有兩個 Process P0 & P1
- 共享變數
  - int turn
  - turn = i –> Pi 可進入 critical section
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    以上的程式碼其實並不完美，這個程式要能順利運作的前提是，兩個 Process 必須輪流執行，但 while loop，並不保證執行的順序，很有可能 Process 0 執行完一次，又想進入 critical section，但 Process 1 還沒執行到 turn = 0 這一行，P0 就會卡在 while loop ，不符合 progress

Peterson’s Solution for Two Processes

為了解決上述 Progress 的問題，我們引入另一個變數 flag ，用來表示 Process 是否想要進入 critical section 中，當 flag[i] = True 的時候，代表 Pi 已經準備好要進入 critical section，turn = j 代表把 key 交給別人。所以 while 條件式 flag[j] && turn==j 的意義為檢查其他 Process 是否可進入 critical section，確認沒有其他 Process 要進入 critical section 後，自己再進入 critical section。要注意如果 turn = j 不是把 token 交給別人而是搶過來用，當某個 Process 先進入 critical section，輪到另一個 Process 時也可以 break while 迴圈，但無法進入 critical section 導致 deadlock

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

簡單的證明 Peterson’s Solution

Mutual exclusion : turn 不可能同時為 0 或 1，因此不可能同時進入 critical section
Progress :
- 若 P1 尚未準備好 –> flag[1] = False –> P0 可以進入
- 若兩者都準備好 –> flag[0] = flag[1] = True
  - 若 turn = 0 則 P0 進入，否則 P1 進入
- 以上兩種狀況，Process 都可以順利進入 critical section
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Bounded waiting :
- 若 P1 離開 critical section –> flag[1] = False –> P0 可以進入
- 若 P1 離開 critical section 後，繼續執行 while loop –> flag[1] = True –> turn = 0 (覆寫 P0 原先的 turn = 1) –> P0 可以進入
- 以上兩種狀況，P1 進入後接著進去的一定是 P0，符合 Bounded Waiting 的要求
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Producer/Consumer Problem

情況一：我們將所有程式碼都放入 critical section，如果 consumer 先 call，因為 buffer 是空的，所以 consumer 會卡在 while 迴圈 ; 這時再 call producer ，因為 consumer 已經在 critical section 裡頭， producer 無法進入導致 deadlock
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
情況二：我們只將可能產生 race condition 的程式碼放進 critical section，雖然執行結果正確，但如果某一個 Process 進入 critical section 其他 Process 進不去，可能導致無謂的 context switch 降低效率
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Bakery Algorithm (n processes)

在進入 critical section 之前，每一個 Process 會抽號碼牌
號碼牌數字最小的先進入 critical section (注意可能有相同的號碼牌，如下方程式碼以 max 實作，而 max 指令其實有好幾行)
號碼牌數字的產生一定是 non-decreasing order; i.e. 1,2,3,3,4,5,5,5…
若兩個 Process Pi & Pj 有相同的號碼牌，則 PID 小的先進入
choosing[i] 為是否正在抽號碼牌的 flag，在與其他 Process j 比較之前，我們必須等待 Process j 抽完號碼牌。考慮一個狀況， Process j 比 i 晚抽完號碼牌，但號碼與 i 一樣大，又 j 的 PID 比 i 小，所以 Process j 應該要先執行 ; 如果在 j 還沒抽號碼牌之前就進行比較，此時 num[j] = 0，程式會以為 j 不要執行，反而先執行 i，當 j 抽完號碼牌之後發現數字與 i 相同，因此也可以進入 critical section，但這時 i 已經在裡頭，導致 deadlock。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Pthread Lock/Mutex Routines

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Condition Variables (CV)

Condition Variables 代表一個條件，符合該條件時可以觸發 thread 執行某些動作
三種 CV 的操作
- wait() : 直到其他 thread 呼叫 signal()/broadcast()，thread 處於等待狀態
- signal() : 喚醒一個等待中的 thread
- broadcast() : 喚醒所有等待中的 thread
在 Pthread 中，CV type 為 pthread_cond_t
- 使用 pthread_cond_init() 來初始化
- pthread_cond_wait (&theCV, &somelock)
- pthread_cond_signal (&theCV)
- pthread_cond_broadcast (&theCV)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- 上圖執行順序如下：
  1. action() 進入 CS，呼叫 thread_cond_wait 並釋放 lock
  2. counter() 取得 lock 進入 CS
  3. counter() 呼叫 singal() ，action() 被喚醒，注意這時候 action() 是沒有 lock 的
  4. counter() 釋放 lock
  5. action() 取得 lock 並執行函式

ThreadPool Implementation

一次創建一定數量的 threads，讓系統不用一直增刪 threads，並且讓系統可以有效控制 threads 的數量，避免過多的 context switch 反而降低效率，達到效能最佳化。

創建 threads 後，為了讓 threads 執行不同的 function，會建立一個 threadpool_task_t 的結構體，存放函式及對應參數的指標，因此 threads 在經過 while loop 被喚醒之後，執行目前看到 threadpool_task_t 的函式，等於處理了一個 request，處理完成後進入 wait 狀態，等待下次被喚醒。而 threads 要如何知道被喚醒，其中一個方式是透過 Condition Variable 判斷以下程式碼為一個簡單的範例：

創建 threads pool 以及 task 的 queue
thread 執行 while loop 處於 wait 狀態
若 thread 被喚醒則離開 while loop，並執行 queue 中 pool->head 的函式
執行後 pool->head 加 1，再判斷該值是否等於 queue_size，若相等表示已經執行到最後一個函式，將 pool->head 設為 0
離開 critical section 執行任務
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Hardware Support

Critical Section Problem 的發生是因為 shared variable 的修正可能被打斷，若某些指令我們可以一行完成，就可以避免 race condition 的問題。前面都是在說軟體上的支援，若硬體可以將指令變成 atomic instructions ，就沒有同步的問題

atomic : 無法中斷執行的最小單元
例如 : TestAndSet(var), Swap(a,b)

Atomic Test And Set()

注意以下程式碼不符合 Bounded waiting

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Atomic Swap()

先 call Swap 的取得進入 critical section 的權力，執行完成後將 token 還給 lock。注意以下程式碼不符合 Bounded waiting

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Semaphores

Samaphores 是一個通用的同步處理工具，其紀錄某個資源有多少 units 可以去使用
- 若 # of record = 1 –> binary semaphore, 但其實用 mutex lock 就好
- 若 # of record > 1 –> counting semaphore
利用兩個 atomic operations wait & signal 來存取 (注意與 critical section 的 wait & signal 不同)
簡單的 Spinlock 以 semaphore 實作
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Semaphore 為 POSIX 標準庫之一，不屬於 Pthread，所以使用上不限於 thread ，也可以用於 Process

Non-busy waiting Implementation

Busy waiting (SpinLock) 因為 while loop，執行效率沒有被最佳化，所以相對 busy waiting 就有 non-busy waiting 的實作方式，但 system call 要考慮 context switch、re-scheduing 的影響，成本比 while 來得高。一般來說，等待時間很短的，我們會用 busy waiting，反之則使用 non-busy waiting

Semaphore 為一個有 queue 的結構體，包含 Semaphore 的值以及有哪些 Processes 等待被執行
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
wait() & signal()
- 使用 system calls: sleep() & wakeup()
- 一樣必須為 atomic 的操作
- 必須先進行 value–(++) 操作，避免進入 if 區域後被 sleep 無法執行
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
由於 value--(++) 還有 queue 的 insert, delete 等，必須確保 wait() & signal() 為 atomic 操作
- single-processor : 透過 disable interrupts
- multi-processor :
  - HW Support (e.g. Test-And-Set, Swap)
  - SW Support (Peterson’s solution, Bakery algorithm)
Semaphore with Critical Section
- 將 value--(++) 以及 queue 的操作包進 critical section
- 因為 sleep() 及 wakeup() 為 system calls，無須擔心同步的問題， OS 會妥善處理，所以不用包進 critical section 中(也不該包進去)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Cooperation Synchronization

有些情況不同 Process 間的執行順序很重要，考慮兩個 thread P1 & P2 分別執行 S1 & S2，S2 必須在 S1 完成後才可執行
- 實作方式 : 利用 shared variable semaphore sync ，以下程式碼確保 S1 一定先執行完成
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
更複雜的情況概念相同，由上述兩個 Processes 的例子做延伸
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Classical Synchronization Problems

常見的問題如下：

Bounded-Buffer (Producer-Consumer) Problem
Reader-Writers Problem
Dining-Philosopher Problem

Reader-Writers Problem

標準在處理 I/O 會遇到的問題
A set of shared data objects
A group of processes
- reader processes (read shared objects)
- writer processes (update shared objects)
- writer process 對 shared object 有獨佔權
不同的種類
- first RW problem : 只要確定 writer 有 access 的獨佔權即可，不關心 reader 要等多久，或要做什麼事情。reader 取得 token 時，可以傳給其他 reader，可能造成 writer 一直被其他 reader 插隊，永遠拿不到 lock
- second RW problem : 不只要確定 writer 的 access 權限，另外當 writer ready 且 object 被釋放時，writer 的優先性必須高於 reader ，不可有新的 reader 進來插隊

First Reader-Writer Algorithm

writer 為 binary semophore
Reader 共享一個 block
readcount 計算有幾個人要讀
- 當 readcount > 1 時，代表已經有 reader 拿了 lock，所以這個 reader 也可以讀
- 當 readcount = 1 時，因為是第一個 reader，可能要等 writer 結束才可讀取
- 當 readcount= 0 岱表示最後一個讀的，讀完後要 signal writer
- readercount 為 shared variable，必須受到保護
Writer可能會有 starvation problem ，因為只要有 reader 在讀，其他 reader 都可以進入 critical section，造成 writer 一直在等待狀態

// mutual exclusion for write 
semaphore wrt=1
// mutual exclusion for readcount
semaphore mutex=1
int readcount=0;

// writer
Writer(){
    while(TRUE){
        wait(wrt);
        // Writer Code
        signal(wrt);
    }
}

// reader
Reader(){
    while(TRUE){
        wait(mutex);
        readcount++;
        if(readcount==1)
            wait(wrt);
        signal(mutex);
        
        // Reader Code

        wait(mutex);
        readcount--;
        if(readcount==0)
             signal(wrt);
        signal(mutex);
    }
}

Dining-Philosopher Problem

5 個人坐在 5 張椅子上，且有 5 支筷子
每個人只會思考或是吃飯
- 思考時與剩下 4 個人沒有互動
- 每個人要用餐需要兩支筷子
- 一個人 1 次只能拿 1 支筷子
- 吃完飯後，將兩支筷子都放下
若每個人都拿其中一邊的筷子，就 deadlock 了
starvation problem
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Monitor

雖然 semaphores 提供非常便利及有效的同步機制，但正確性主要是依賴 programmer

所有存取 shared data object 的 processes 都需確保 wait() & signal() 的執行順序及位置正確
但是人總有可能犯錯
可以將 Moniter 想成一個特殊 class (high level language construct)，每一個 Moniter 都有可能發生 race condition 的 local variable (對應下圖的 shared data) 需要保護，而要操作這些 var 必須透過 Moniter 的 method。
Monitor 與 OO 最重要的差異為，Monitor 一次只有一個 thread 可以執行其 method。(但可以很多個 threads 同時 call 這個 method，只是會處於 inactive 狀態)
monitor 利用 queue 做排程，確保一次只有一個 process 是 active 的，保護 shared data
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Monitor Condition Variables

為了允許 process 等待 monitor，需要宣告 condition variable，例如 condition x, y;
condition variable 只能與 wait() 及 signal() 一起使用
- x.wait()
  - 代表調用這個 operation 的 process 被閒置，直到其他 process 完成為止
- x.signal()
  - 啟動一個等待中的 process，若沒有 process 處於等待狀態，則 signal() 操作沒有任何影響。與其相反，semophore 的 signal() 一定會改變 semaphore 的狀態 (wait 把 counter 減 1; signal 把 counter 加 1)
- 在 monitor 當中，condition variable 不是 counter 而是一個佇列，所以 x.wait() 類似把 process 放到 waiting queue (enqueue)，而 signal 則是在 dequeue，所以如果 queue 是空的，signal 就沒有任何作用
- Monitor 可以有很多個 threads 存在，但一次只有一個 active
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Dining Philosophers Example

monitor type

建立三個狀態 thinking, hungry & eating
建立 shared variable self[5]，存放每個人的狀態
- P.S. 可以用 shared variable 都還算好解決，難解的是無法得知他人狀態的問題
定義 4 個 method

monitor dp {
    enum {thinking, hungry, eating} state[5]; //current state
    condition self[5]; //delay eating if can’t obtain chopsticks
    void pickup(int i) // pickup chopsticks
    void putdown(int i) // putdown chopsticks
    void test(int i) // try to eat
    void init() {
        for (int i = 0; i < 5; i++)
            state[i] = thinking;
    }
}

void pickup(int i)
- 將狀態設為 hungry，代表要吃東西了
- 測試拿筷子後是否可以吃
- 若狀態沒有改變，表示測試失敗，把自己放到 wating queue

void pickup(int i) {                
state[i] = hungry;                  
test(i); // try to eat               
if (state[i] != eating)                  
    // wait to eat                   
    self[i].wait();             
}

voild putdown(int i)
- 吃完了將狀態改為 thinking
- 告知左右兩邊進行測試，看有沒有辦法吃

void putdown(int i) {
state[i] = thinking;
// check if neighbors are waiting to eat
test((i+4) % 5); 
test((i+1) % 5);
}

void test(int i)
- 若左右兩邊的人都沒在吃，代表左右兩邊手上都沒筷子，還要確認自己是否想吃東西
- 都符合的話，將狀態改為 eating 並喚醒 thread/process

// t## Ch7 Deaklock

### Deadlock Problem
- 一系列被 blocked 的 processes 持有部份資源且都在等待獲得其他 processes 擁有的資源
- Ex1: 2 processes and 2 tape drivers
  - 每一個 process 擁有一個 tape drive
  - 每一個 process 要求其他的 tape drive
- Ex2: 2 processes and semaphores A &
  - P1 (holb B, wait A): wait(A), signal(B)
  - P2 (holb A, wait B): wait(B), signal(A)
#### 必要條件
- Mutual exlusion
  - 一次只能有一個 process 可以使用資源
- Hold & Wait (同時 Hold & Wait)
  - 一個 process 持有某些資源且等待其他 processes 擁有的資源
- No preemption (有些情況很難保存 state 或保存的 cost 很高)
  - 資源只能被持有的 process 自己放棄
- Circular wait
  - 存在一系列等在中的 processes {P0, P1, …, Pn} 形成環狀等待
  - P0 -> P1 -> P2 -> … -> Pn -> P0
- 當以上*四個條件*都發生時，才會有可能的 deadlock 的情況發生
#### System Model
- 資源類型 R1, R2,…,Rm
  - 如 CPU, memory pages, I/O devices
- 每一個資源類型 Ri 有 Wi 個 instance
  - 如一個電腦有兩個 CPUs
- 每一個 process 使用資源方式如下
  - Request -> use -> release
#### Resource-Allocation Graph
- 3 個 processes, P1 ~ P3
- 4 個 resources, R1 ~ R4 (資源可能有多個 instances)
  - R1 & R3 各有一個 instance
  - R2 有兩個 instance
  - R4 有三個 instance
- Request edges
  - P1 -> R1: P1 要求 R1
- Assignment edges
  - R2 -> P1: 一個 R2 的 instance 被 assign 給 P1
- 若圖上有 *circular* 存在，則*可能*有 deadlock 存在
- 左圖有 deadlock，右圖沒有 deadlock  
![](img/Ch7.Deadlock.jpg)![](img/Ch7.Cycle_No_Deadlock.jpg)  
#### Deadlock Detection
- 若圖沒有 cycle --> 沒有 deadlock
  - 因為 Circular wait 不會形成
- 若圖有 cycle
  - 若每一個資源類型都只有一個 instance --> deadlock
  - 若每一個資源類型都有多個 instance --> 可能會 deadlock
#### Handling Deadlocks
- 方法1.，確保系統永遠不會進入 deadlock 的狀態
  - deadlock prevention: 確保四個條件中至少一個條件沒有滿足
  - deadlock avoidance: 分配資源前，Run-time 動態地檢查目前資源分配的狀況
- 方法2.，允許 deadlock，發生再想辦法修復
  - deadlock detection
  - deaklock recovery
- 方法3.，忽略問題且假設系統中永遠不會發生 deadlock
  - 大多數作業系統使用此方法，包含 Unix 系統
### Deadlock Prevention
- Mutual exclusion (ME): 將不需要 ME 的資源取消 ME
  - read-only 的資料不需要 mutual exclusion
  - 但有些資源不能共享，如 critical section、printer 等
- Hold & Wait:
  - 當一個 process 請求資源時，process 不能持有任何資源
  - 必須讓所有的資源一次性地都有對應的 process 處理，才進行下一步動作
  - 缺點是資源使用率非常低，可能很多 process 一直在 acquire 跟 release，沒辦法真正的執行
- No preemption:
  - 當 process 在等待資源時，該 process 持有的所有資源必須 preempted 掉
    - e.g. P1 請求目前被 P2 使用的 R1，而 P2 在等待 R2
    - P2 紀錄目前執行的狀態後，R1 可以被 preempted 掉，並重新分配給 P1 使用
    - 適合用在容易紀錄及恢復狀態的資源，如 CPU 的 register (context switch) 或 memory
- Circular Wait:
  - 對所有資源類型加上 total ordering，讓資源的取得有方向性
  - process 依遞增的順序來取得資源
    - 若$R=R_0,R_1,...,R_N$為一系列資源
    - 當 process 要求 $R_k$時，必須將所有$R_i,i>=k$ 釋放掉才可拿取資源
  - 例如現在有三個資源 F(tape drive) = 1, F(disk drive) = 5, F(printer) = 12，process 在拿 printer 的資源時，一定要將 printer 釋放掉，才能拿 tape & disk drive
  - 證明：要形成 Circular Wait，一定有以下情況
    - $P_0(R_0)->R_1,P_1(R_1)->R_2,...,P_N(R_N)->R_0$
    - 但這樣代表 $R_0$ 必須大於 $R_N$，產生矛盾

### Deadlock Avoidance
若無法做 Preventon，則必須在 run time 期間進行動態檢查
#### Avoidance Algorithms
- Single instance of a resource type
  - resource-allocation graph (RAG) algorithm based on circle detection
- Multiple instances of a resource type
  - banker’s algorithm based on safe sequence detection
#### Resource-Allocation Graph (RAG) Algorithm
- Request edge:
  - $P_j->R_j$，$P_j$在等待$R_j$的資源
- Assignment edge:
  - $R_j->P_j$，$R_j$被$P_j$所持有
- Claim edge:
  - $P_j->R_j$，$P_j$在未來可能要求$R_j$
  - 當 process 要求資源時，claim edge 轉換成 request edge
- 當資源被 process 釋放時，Assignment edge 轉換成 Claim edge

- Avoidance 基本上使用有向圖的方式建立 Claim edge，確認當任務 Assign 時是否有 Loop 產生，若產生迴圈則將拒絕或延遲該請求，如下圖 $P_1−>R_2$的請求在該狀況下不會被同意  
![](img/Ch7.Resource-Allocation_Graph_Algorithm.jpg)  
- Cycle detection 的演算法為 $O(n^2)$
#### Multiple instances
- Safe state: 若存在 sequence allocation 滿足所有 process 的請求，則系統處於 safe state
  - 此 sequence of allocation 稱為 safe sequence
- safe state -> no deadlock
- unsafe state -> 可能有deadlock
- deadlock avoidance: 確保系統永遠不會進入 unsafe state  

**Safe State with Safe Sequence**
- 有 12 個 tape drives
- t0 的狀態如下  
![](img/Ch7.Safe_State_with_Safe_Sequence.jpg)  
- $P_0$最多可能會一次要 10 個資源，才能執行下一個動作
- 目前已使用 9 個資源，剩下 3 個資源，因為$P_1$最多只需要 4 個，所以我們可以先把剩下的 3 個資源全部先給$P_1$，$P_1$執行完之後可用資源變成 5 個。這時候將 5 個資源全部給$P_0$，最後再給$P_2$就可以順利執行所有的 process，$<P_1,P_0,P_2>$稱為 safe sequence

**Un-Safe State w/o Safe Sequence**
- 仍然可能找不到 safe sequence，如下圖剩下 2 個資源，$P_1$執行後無論給$P_0$或$P_2$都不夠  
![](img/Ch7.Un-Safe_State.jpg)  
- 但仍未進入 deadlock 的狀態
- 這時候$P_0\&P_0$的請求就會被拒絕或延遲

**Banker’s algorithm**
- 適用於對每個資源是 multiple instances 的情況
- 使用通用的 *safety algorithm* 預先決定是否有 safe-sequence 存在
- 只在 safe-sequence 存在時才會去執行動作
- Algorithm：
  1. 假設 processes 需要 *Maximum resources*
  2. 找到一個 free resources 可以滿足的 process，將所有 free resources 給它
  3. 將該 process 的所有資源釋放掉
  4. 重複上述動作直到所有的 processes 都可滿足執行條件
- 範例：
  - Total instances: A=10, B=5, C=7
  - 初始 available 的 instances: A=3, B=3, C=2
  - Safe sequence 為 P1 -> P3 -> P4 -> P2 -> P0  
![](img/Ch7.Banker’s_Algorithm.jpg)  

### Deadlock Detection
偵測目前系統狀態有沒有 deadlock
- Single instance
  - 將 request/assignment edges 轉換為 wait-for graph
  - 若 wait-for 存在 circle 則有 deadlock
  ![](img/Ch7.Resource_allocation_graph_Corresponding_wait-for_graph.jpg)  
- Multiple instances: 不用管 Maximum Needs，只要確認 request 會不會造成 deadlock
  - Total instances: A:7, B:2, C:6
  - Available instances: A:0, B:0, C:0
  - safe state -> <P0, P2, P3, P1, P4> -> no deadlock
  - If P2 request = <0, 0, 1> -> no safe sequence -> deadlock  
![](img/Ch7.Multiple-Instance_for_Each_Resource_Type.jpg)  

### Deadlock Recovery
- Process termination
  - kill 全部的 deadlock processes
  - kill 一個或多個在 deadlock circle 的 process
    - 要決定哪些 process 應該被砍掉，即便砍掉 priority 最低的 process 也不一定能解決 deadlock 的問題，這是一個需獨立出來討論的議題
- Resource preemption
  - 選擇一個 victim 去 preempt
    - 哪一個應該被 preempt ?
  - rollback
    - 部份或全部 rollback ?
  - starvation
    - 若一直 preempt 相同的 process，則該 process 可能要等待很久的時間才能被執行

## CH11 File System Interface
### File Concept
- File: 為 OS 所創造的邏輯儲存單元 (logical storage unit)
  - 實體儲存單元 (physical storage unit) 則是 disk (sector, track)
- File 的屬性
  - Identifier: non-human-readable name, Name, Type, Location, Size, Protection, Last-access time, Last update time
- File 的操作
  - 創建、讀寫、repositioning、刪除、截斷 (truncate)、串接 (concatenate)
- Process 會儲存一個 open-file table
- Os 會儲存一個 system-wide table
#### Open-File Tables
- Open-file table 儲存 process 開啟的檔案資訊
  - 紀錄所有被這個 process 開啟的檔案
  - 每一個開啟檔案的指標
  - 檔案的存取權以及 Accounting information (CPU數量，真實使用的時間, 時間限制, 帳戶數量, 行程數量)
- System-wide table 儲存 OS 系統所有開啟的檔案資訊
  - open-file tables 所有檔案指向 system-wide table
  - 還會儲存獨立於 process 的資訊，如 disk location, access dates 還有 file size
  - Open count
    - 紀錄各個檔案被指向的次數，代表目前有多少 user 在存取該檔案
    - 若不同 process 存取同一個檔案，只會有一份檔案，確保資料的一致性  
    ![](img/Ch11.Open-File_Tables.jpg)  
#### Open-File Attributes
- Open-file attributes (metadata)
  - File pointer (per-process)
  - File open count (system table)
  - Dist location (system table)
  - Access rights (per-process)
- File types
  - 副檔名，如 .exe, .com, .obj …etc
  - 副檔名沒有實質意義，只是讓 OS 先嘗試用副檔名對應的應用程式開啟

### Access Method
- Sequential access: 檔案可以視為 Array of bytes，Sequential access 在讀檔時是連續讀取的，因此需要的資訊就是目前的位置以及要讀取多少長度
  - Read/write next (block)
  - Reset: 將 file pointer 重設到檔案的開頭
  - Skip/rewind n record
  - 因為硬碟磁寫頭的關係，連續的檔案讀寫效率較高
  - 邏輯上的連續，檔案儲存的實體硬碟位置不一定要連續  
  ![](img/Ch11.Sequential-access_file.jpg)  
- Direct (relative) access: 直接告訴系統要讀取檔案什麼位置的資料
  - 讀取 sequence 資料的任意元素
  - File operations 需要 block number 作為參數
  - 通常是為了 random access 的執行
  - 檔案由多個*大小固定*的區塊所組成，可直接將檔案指標移動到某個特定的區塊，並一次讀取整個區塊  
  ![](img/Ch11.Simulation_of_sequential_access_on_a_direct-access_file.jpg)  
- Index Access Methods
  - Index: 檔案各區塊的指標
  - 為了找到檔案中的 record:
  - 尋找 index file --> 找到指標
  - 利用指標去直接存取 record
  - 若檔案太大， index 有可能太大  
  ![](img/Ch11.Example_of_index_and_relative_files.jpg)  
- Physical 的硬體設備，執行 random access 類型的操作效率較低，如果要支援 random access，通常要有 memory cache 才會夠快

### Directory Structure
#### Partition, Volume & Directory
- A partition 硬碟分割 (formatted or raw)
  - raw partition (無檔案系統)： UNIX swap space, database
  - Volume 意指在硬碟格式化有了檔案系統之後的磁碟區域
  - 一個 partition 可以為硬碟的部份區域或是由多個硬碟組成 (如分散式系統)
  - 有一些儲存設備無法被分割，如 floppy disk (ex: 3.5 磁碟片)
- Dirctories 為檔案系統在 partition 儲存檔案資訊的區塊  
![](img/Ch11.A_typical_file-system_organization.jpg)  
#### Directory vs. File
- Directory: 為一群包含所有檔案資訊的節點
  - Directory structure 跟 files 都位於硬碟中  
  ![](img/Ch11.Directory_structure.jpg)  
- Directory 相關的操作
  - 搜尋檔案、創建檔案、刪除檔案、列出目錄清單(對現代檔案系統常常是效能瓶頸處，因為可能資料很多、檔案的樹狀結構很深)、重新命名檔案、遍歷檔案樹狀結構
#### Single-Level Directory
- 所有檔案都在一個 Directory 底下
- 檔案名稱必須唯一
- 當檔案越來越多，找尋檔案的效率非常差 (一定要遍歷)  
![](img/Ch11.Single-level_directory.jpg)  
#### Two-Level Directory
- 對每一個使用者有不同的 directory
- 路徑 = user name + file name
- 但是對每一個 user 來說，搜尋的效率仍然很差  
![](img/Ch11.Two-level_directory_structure.jpg)  
#### Tree-Structured Directory
- 絕對路徑： 由根節點開始
- 相對路徑： 由 directory (子節點)開始  
![](img/Ch11.Tree-structured_directory_structure.jpg)  
#### Acylic-Graph Directory
- 使用 links 來共享檔案或 directories
  - UNIX-like: symbolic link (ln -s/spell/count/dict/count)(<新連結位置> <原被連結位置>)
- 檔案可以有多個絕對路徑 (同個檔案給不同節點使用)
- 那何時檔案才真正被刪除？
  - 刪除檔案但留下 link --> dangling pointer
  - 真正保險的作法是刪除 link 但不刪除檔案，當檔案的 counter 歸零時才把檔案刪除  
  ![](img/Ch11.Acyclic-graph_directory_structure.jpg)  
#### Gerneral-Grpah Directory  
![](img/Ch11.General_graph_directory.jpg)  
- 與 Acylic-Graph 最大的差異是可能有迴圈
  - counter 在這邊無法確保 dangling pointer
  - 可能有 self-referencing file
- 所以問題在於要如何處理 dangling pointer
  - 定期做 Garbage collection
    - 首先遍歷整棵樹，標記樹可達到的檔案及 directories
    - 將沒有被標記的檔案 free 掉
    - 但是當檔案很多時效率很差，所以很多檔案系統不支援
  - 另一種解法是在檔案被建立時確認是否會產生迴圈  
  ![](img/Ch11.General-Graph_Directory.jpg)  

### File-System Mounting & File Sharing
#### File Mounting
- 要使用檔案系統時，必須先掛載到系統中，系統才能存取
- Mounting point: 意指 File System 被掛載的根路徑
- Mounting time:
  - boot time (如系統槽，作業系統也是存在檔案系統中)
  - automatically at run-time，如 usb 自動偵測
  - manually at run-time  
  ![](img/Ch11.File_System_Mounting_Example.jpg)  
#### File Sharing on Multiple Users
- 對每一個使用者會有 (userID, groupID)
  - ID is associated with every ops/process/thread the user issues
  - 一個 user 可以屬於多個 groupID
- 每一個 file 有三種屬性等級
  - owner (擁有者權限)
  - group (所屬 group 的權限)
  - others (非擁有者也非 group 成員的權限)
- 擁有者 attributes 描述擁有者對檔案操作的權限
  - group/others 也有對應的 attributes
  - group/others 的權限由 owner 或 root 決定  
  ![](img/Ch11.File_Sharing_on_Multiple_Users.jpg)  
#### Access-Control List
- 為了控制不同層級使用者的檔案權限，對每個 user 建立 Access-control list
  - 當使用者發出對檔案操作的請求時，透過 ACL 的資訊來確認是否合法
  - 但當使用者數量很多時，透過 ACL 確認太耗費資源及時間
- 現行作法是將使用者分成 3 類 --> 用 3 個 bit (RWX) 描述不同層級 user 對檔案的執行權限
  - owner (e.g. 7 = RWX = 111)
  - group (e.g. 6 = RWX = 110)
  - public (others) (e.g. 4 = RWX = 100)
  - 如下圖第一行中的 -rw-rw-r–，分別代表 owner-group-others 的權限  
  ![](img/Ch11.Access-Control_List.jpg)  
#### File Protection
- 檔案的擁有者/創建者必須能控制
  - 能執行哪些操作
  - 誰能執行這些操作
  - 由 ACL 定義
- 檔案應該要防止以下問題
  - physical damage (reliability): i.e. RAID
  - 不適當的存取: i.e. 密碼
- RAID:
  - 以下出自 Wiki
  > 容錯式磁碟陣列（RAID, Redundant Array of Independent Disks），舊稱容錯式廉價磁碟陣列（Redundant Array of Inexpensive Disks），簡稱磁碟陣列。利用虛擬化儲存技術把多個硬碟組合起來，成為一個或多個硬碟陣列組，目的為提升效能或資料冗餘，或是兩者同時提升。
  > 在運作中，取決於 RAID 層級不同，資料會以多種模式分散於各個硬碟，RAID 層級的命名會以 RAID 開頭並帶數字，例如：RAID 0、RAID 1、RAID 5、RAID 6、RAID 7、RAID 01、RAID 10、RAID 50、RAID 60。每種等級都有其理論上的優缺點，不同的等級在兩個目標間取得平衡，分別是增加資料可靠性以及增加記憶體（群）讀寫效能。  

  - 再參考 [筆記](https://hackmd.io/@Pl-eQT9CQaS0jhExKqL8_w/BkhOSR4jW/https%3A%2F%2Fhackmd.io%2Fs%2FByk4uIjGG)  
  > RAID在實體上是多顆硬碟，在系統中被當作一顆硬碟使用，而在作業系統底下，也還是可以將它分割為單一或多個分割區。因此建立好的RAID，使用起來跟單一硬碟是完全相同的，只是依組成方式的不同。RAID可以提供更大的容量、更高的讀寫效能，或是額外的「安全性」。（這裡所說的「安全性」，是指硬碟損毀之後資料重建、回復的能力，與加密防駭等功能無關）  

## Ch12 File System Implementation
### File-System Structure
#### File-System Structure
- 記憶體跟硬碟之間資料傳輸 I/O 的單位為 block
  - 一個 block 可能有多個 sector
  - 一個 sector 通常為 512 bytes
- 一個作業系統可以使用多個檔案系統
  - 如 NTFS, FAT32
- 檔案系統設計的兩個主要的問題
  - 與使用者程序的介面 (上層)
  - 與硬碟的介面 (下層)
#### Layered File System
- File System 實作上通常由以下三個部份組成
  - manages metadata: 找到 file 並提供 file handler
  - logical <–> physical mapping: 將程序要讀取內容的位置轉換成 block 的位置
  - read d1, c73…: 轉換後 physical 的 hard-drive 及 block 位置  
  ![](img/Ch12.Layered_File_System.jpg)  
- 對使用者而言，只有一個檔案系統的 Interface (FS manager) 存在，而 FS manager 會去呼叫檔案對應的檔案系統來執行  
![](img/Ch12.Layered_File_System_2.jpg)  

### File-System Implementation
#### On-Disk Structure(Structrue 要永久存在，所以一定要在 disk 上)
- Boot control block (per partition): 存放啟動作業系統相關資訊的 partition
  - 通常為 partition 的第一個 block (若是空的代表沒有 OS)
  - UFS (Unix File Sys.): boot block, NTFS: partition boot sector
- Partition control block (per partition): partition details
  - details: blocks 的數量, block 的大小, free-block-list, free FB pointers 等
  - UFS: superblock, NTFS: Master File Table
- File control block (per file): 跟檔案相關的 details
  - details: 權限, 大小, data blocks 的位置 (physical 的位置)
  - UFS: inode, NTFS: stored in MFT (relational database)
- Directory structure (per file system): organize files  
![](img/Ch12.On-Disk_Structure.jpg)  
#### In-Memory Structure
- in-memory partition table: 每個被掛載的 partition 的資訊
- in-memory directory structure: 最近存取的 directories 的資訊
- system-wide open-file table: 保存每一個被開啟檔案的 File Control Block
- per-process open-file table: 記錄僅跟 process 有關的資訊，比如讀檔位置、開檔權限等，另外會有 pointer (file handler/descriptor) 指到對應的 system-wide table entry
####  File-Open & File-Read
- 開檔時，會先去 directory structure 看檔案是否在記憶體中，若不存在則去硬碟把檔案的 directory structure 資訊讀進來，找到我們要的節點並指向 file control block，這樣就有了檔案的所有資訊
- 讀檔時，File handler 為 per-process open-file table 的某一個 entry，拿到 handler 確保檔案的確 open 之後，必須從 system-wide table 取得檔案的實際位置，才能從硬碟存取檔案內容  
![](img/Ch12.File-Open&File-Read.jpg)  
#### File Creation Procedure
1. 作業系統配置一個新的 FCB
2. 更新 directory structure
   - OS 將對應的 directory structure 讀進記憶體中，檢查有無同名稱的檔案
   - 將新的檔案名稱以及 FCB 更新進 directory structure
   - (在檔案關閉後) OS 將 directory structure 寫回硬碟中
3. 檔案出現在 user 的 directory command
#### Virtual File System
- VFS 提供物件導向的方法實作 file system
- 對於不同檔案系統，VFS 使用同一個 system call interface
- VFS 依 partition info 呼叫適當的 file system routines
- 在 Linux VFS 中有 4 個主要的物件類型
  - inode --> 單獨的檔案 (on disk)
  - file object --> 開啟的檔案 (檔案開啟後的操作物件)
  - superblock object --> 整個 file system (對應到 partition)
  - dentry object --> 單獨的 directory entry
- VFS 定義一系列需要實作的操作
  - int open(…) --> 開啟檔案
  - ssize_t read() --> 讀取檔案
#### Directory Implementation
- Linear lists (Linked list)
  - 檔案名稱的串列，且有相應的指標指向 data blocks
  - 容易實作但 performance 差
    - insertion, deletion, searching
- Hash table --> linear list w/ hash data structure
  - O(1) 搜尋時間
  - 每一個 hash key 對應到一個 linked list，避免 hash collision
  - hash table 通常有固定數量的 entry 數量  
  ![](img/Ch12.Directory_Implementation.jpg)  

### Allocation Method
#### Contiguous allocation
- 檔案在硬碟中的 block 擺放位置要連續
  - 不需要 traverse 整個硬碟，只要簡單的計算就知道檔案在哪一個 block --> *磁頭轉動的數量最小化*
  - Directory entry 只要存 start 跟 length，非常簡單
- 無論是 sequential 或 random 的存取都可以有效率地實作
- 很容易有 External fragment
  - 解法 --> compaction
- 檔案不能隨意地增長
  - 解法 --> extend-based FS (跟 linked list 結合)
#### Extent-Based File System
- 許多新的 file system 使用 modified contiguous allocation scheme
- extent 是一個硬碟連續的 blocks
  - 一個檔案可以有一個或多個 extents
  - 一個 extent 若用完了，系統會給檔案另一個 extent，利用 linked list 串起來
  - extents block 必須有一個 pointer 紀錄下一個 extent 的位置
- 缺點
  - Random access 的成本增加 (例如檔案在第三個 extent，必須要讀前兩個 extent 才知道第三個 extent 的位置，增加硬碟讀取的次數)
  - 仍然有 internal & external 的 fragment  
  ![](img/Ch12.Extent-Based_File_System.jpg)  
#### Linked Allocation
- 檔案由一個 linked list of blocks 所組成
  - 每一個 block 有 pointer 指向下一個 block
  - 因為需要空間放 pointer，所以 block 中實際存放的大小為 block size - pointer size
- Directory 只需要存起始跟結束的 block id
- 建立新檔案時並不需同時就宣告檔案大小
- 缺點
  - 只適合 sequential-access
    - random access 需要遍歷 linked list
    - 因為 link pointer 存放在 block 中，每次存取 block 硬碟都要讀取一次
  - 需要 pointer 的空間 (4/512 = 0.78%)
    - 解法：unit = cluster of blocks (把很多 blocks 組成一個大的 blocks，減少遍歷的長度)
    - 會有 internal fragment 的問題
  - Reliability
    - 只要遺失一個 link 等於整份檔案遺失  
    ![](img/Ch12.Layered_File_System_2.jpg)  
#### FAT (File Allocation Table) file system
- FAT32
  - 在 MS/DOS & OS/2 中使用
  - 將 links 存成一個 table (所有檔案的 links)
  - 每一個 table entry 大小為 32 bits
  - 放在 partition 的起始點  
![](img/Ch12.FAT32.jpg)  
- FAT (table) 通常會 cache 在記憶體中
  - Random access 的效率改善
  - 硬碟只要讀取 FAT 就可以找到檔案 block 的位置
#### Indexed Allocation Example
- Diretory 擁有 file index block 的 address
- 每一個 file 有自己的 index block
- index block 儲存檔案資料的 block id
- 將所有 blocks 的 pointer 集中存在一個 block 中
- 優點
  - direct & random access 非常有效率
  - 沒有 external fragmentation
  - 很容易去建立檔案 (只要有 free list，就知道哪裡有空間放檔案)
- 缺點
  - 需要額外放置 index 的 blocks，這些 blocks 只能放 index，空間被浪費掉
  - 另一個問題是，當檔案太大，一個 index block 存不下時，那 index block 應該要多大？
    - linked scheme
    - multilevel index
    - combined（inode in BSD UNIX） 
    ![](img/Ch12.Indexed_Allocation_Example.jpg)  

**Linked Indexed Scheme**
- 將 index blocks 用 linked list 串起來
- 適合 small to medium size 的檔案  
![](img/Ch12.Linked_Indexed_Scheme.jpg)  
**Multilevel Scheme (two-level)**
- 第一階的 index block 儲存第二階 index blocks 的指標
- 第二階 index blocks 的指標才指向 file blocks
- 適合 large size 的檔案  
![](img/Ch12.Multilevel_Scheme.jpg)  
**Combined Scheme: UNIX inode**
- File pointer: 4B (32 bits) --> 只能存放 4GB ($2^{32}$) 的檔案
- 可以讓不同大小的檔案，使用不同類型的 index scheme，提昇檔案讀取的效率
- 假設一個 block 的 entry 數量為 12，綠色因為每個 index blocks 還要存 2 個其他 index - blokcs 的 pointer，所以檔案大小乘上 $2^{10}$  
![](img/Ch12.UNIX_inode.jpg)  

### Free-Space Management
- Free-Space list: 紀錄所有硬碟中 free 的 blocks
- Scheme
  - Linked list (same as linked allocation)
  - Grouping (same as linked index allocation)
  - Counting (same as contiguous allocation)
- 檔案系統通常用與管理檔案相同的方式管理 free space
#### Bit vector
- Bit Vector (bitmap): 一個 block 一個 bit 來表示，0 為 free; 1 為 occupied
- 優點
  - 簡單且快速 (硬體支援 bit-manipulation instruction)
- 缺點
  - 為了效能必須存在快取中，當檔案很大時 bitmap 會非常大，超過快取的空間
  - 例如一個 1TB(4KB block) 的硬碟需要 32MB 的 bitmap  
  ![](img/Ch12.Bit_vector.jpg)  
#### Grouping & Counting
- Linked List
  - 所有的 free blocks 利用 linked-list 串在一起
  - 將第一個 free block pointer 存在 disk 和快取中的特別位置
  - 遍歷 list 效率低
  - 可參考 FAT 的作法所有的 link-pointers 存在一個 table 當中
- Grouping
  - 與 linked-index allocation 一樣
  - 將 n free blocks 的 address 放在 1st block
  - 前（n-1） pointers 是 free blocks
  - 最後一個 pointers 指向另一個 grouping block
- Counting
  - 與 contiguous allocation 一樣
  - 儲存第一個 free block 的指標和 contiguous free blocks 的數量
 
## Ch10 Mass Storage Structure
### Disk Structure
- 硬碟被 addressed 為一維的 logical blocks 陣列
  - logical block 為最小的轉換單位，一個 block 有很多個 sector
- Logical block 以 sequential 的方式映射到硬碟當中
  - Sector 0 為最外圈 1st track 的 1st sector
  - 從最外圈到最內圈  
  ![](img/Ch10.Disk_Mechanism.jpg)  
#### Sectors per Track
- Constant linear velocity (CLV)
  - 每一個 track bits 的密度相等
  - 外圈有更多的 sectors
  - 保持相同的 data rate
    - 所以內圈的旋轉速度要加快
  - 應用： CD-ROM & DVD-ROM
- Constant angular velocity (CAV)
  - 保持相同的旋轉速度
  - 內圈的 bit density 較高
  - 保持相同的 data rate
  - 應用：硬碟
#### Disk I/O
- 硬碟逆用 I/O bus 跟電腦連接
  - EIDE, ATA, SATA (Serial ATA), USB, SCSI 等等
  - I/O bus 利用 controller 來控制
    - Host controller (computer end)
    - Disk controller (disk drive 內建)

### Disk Scheduling
#### Introduction
- Disk-access 時間主要分成三個部份
  - seek time: 將讀寫頭移到需求的環上 (因為移動磁頭是機械式的行為， seek time 是影響硬碟讀取時間最主要的成份，scheduling 要盡可能讓磁頭在讀取多個檔案時能沿著一個方向移動)
  - rotational latency: 旋轉讀寫頭到需求的 sector 上
  - read time: 資料轉移的時間
- Disk bandwidth
  - bytes transferred/(完成最後一個請求的時間 - 開始讀取的時間)
#### Disk Scheduling
- 最小化 seek time
- 處理硬碟 I/O 請求的算法
  - FCFS（first-come, first-served）
  - SSTF（shortest-seek-time-first）
    - 普遍使用，但不是最佳的算法，且可能有 starvation 的問題
    - 適合 loading 輕的情況
  - SCAN
    - 在高負載的情況下表現佳
    - 沒有 starvation 的問題
  - C-SCAN（circular SCAN）
  - SCAN 的延伸，讓等待時間更為平均
  - LOOK and C-LOOK
#### FCFS (First-Come-First-Served)
- 對磁頭移動來說，只關心 track 的 ID
- 假設現在的 request queue 為 98, 183, 37, 122, 14, 124, 65, 67，Head pointer 為 53，總移動量為 640  
![](img/Ch10.FCFS.jpg)  
#### SSTF (Shortest-Seek-Time-First)
- SSTF 排程為 SJF 的一種，可能導致 starvation 的問題
- 總移動量為 236  
![](img/Ch10.SSTF.jpg)     
#### SCAN Scheduling  
- 磁頭先往一邊走再往另一邊走
- A.K.A elevator algorithm
- 各個請求被讀取的時間容易不平均 (ex: 若磁頭往左邊移動但新請求剛好再最右側)
- 總移動量為 236  
![](img/Ch10.SCAN.jpg)
#### C-SCAN Scheduling
- 磁頭只往一個方向移動
- 同方讀到底之後，會直接回到另一頭 (Round-Robin)
- 各個讀取時間比 SCAN 算法更為平均
- 總移動量 382  
![](img/Ch10.C_SCAN.jpg)  
#### C-Look
- 會看是否有新的請求與目前磁頭移動方向同向，有的話繼續延該方向移動
- 若同方向沒有新的請求，則磁頭不會回到最開頭，會從另一個方向最靠近目前位置的請求開始
- 若決定好移動位置，但此時有個介於之間的請求進來 (如 53 -> 65，決定了之後 64 才進來)，必須要等到下一次才會被讀取  
![](img/Ch10.C_LOOK.jpg)  
### Disk Management
#### Disk Formatting
- Low-level formatting (or physical formatting):
  - 硬碟出廠時，已完成格式化作業
  - 將 disk （magnetic recording material）切分成數個 dist controller 可讀寫得 sectors
  - 每個 sector = header + data area + trailer
    - header & trailer 存放 sector 的數量及 ECC (error-correcting code)
    - ECC 藉由所有在 data area 的 bytes 計算，用 trailer 的 ECC 檢查硬碟有無壞軌
    - data area size 通常為 512B, 1KB or 4KB
- OS 接著做以下兩個步驟來使用硬碟
  - 將硬碟 partition 成一個或多個 groups of cylnders
  - logical formatting (建立 file system)
#### Boot Block
- Bootstrap program
  - 初始化 CPU, registers, device, controllers, memory 並且啟動 OS
  - 第一階段 bootstrap code 存放在主機板的 ROM (Read Only Memory)
  - 完整的 bootstap 放在 boot block of boot disk (系統碟)
- Booting from a Disk in Windows 2000
  1. 執行 ROM 中的 bootstrap code
  2. 讀取 MBR (Master boot recode) 的 boot code
  3. 利用 partition table 找到 boot 的 partition 位置
  4. 讀取 boot sector/block 並繼續開機作業  
  ![](img/Ch10.Windows_Booting.jpg)  
- Bad Blocks
  - Simple disks like IDE disks
    - 人工執行格式化程式來標記 bad block 對應的 FAT entry
    - 壞軌的區域不允許用來 allocation
  - Sophisticated disks like SCSI disks
    - disk controller 會維護一個 bad blocks list
    - List 在硬碟的生命週期會持續更新
  - Sector sparing (forwarding): 將壞軌區的資料 map 到備用的區域
    - 可能會影響到 disk-scheduling 的表現 (Physical 位置不連續)
    - 在硬碟 formatting 時，每個 cylinder 會留少數備用的 sectors
  - Sector slipping: 若要避免 physical 位置不連續的問題，則必需將整個資料做搬移
- Swap-Space Management
  - Swap-space: 虛擬記憶體使用硬碟空間 (swap-space) 作為主記憶體的延伸
  - UNIX: 允許使用多個 swap spaces
  - Location
    - 一般檔案系統的一部分 (e.g. NT)
      - 效率較差
    - 獨立的 partition (raw partition)
      - 效率較好
      - 但 Size 固定
    - 允許上面兩種方式 (e.g. Linux)
- Swap Space Allocation
  - 1st version: process 創建時放在記憶體，並複製一份完整的內容到連續的硬碟空間。主要問題是所有內容都複製，造成空間浪費
  - 2nd version: 只複製要被 swap 的 pages 到 swap space
    - Solaris 1:
      - text segments(程式碼) 可以從檔案系統讀取，所以當 swap out 的時候直接捨棄
      - 只有 anonymous memory (stack, heap, etc) 這些動態配置的記憶體空間，在虛擬記憶體創建時，會存到 swap space
    - Solaris 2:
      - 更進一步只有當動態配置的記憶體空間需要被 swap out 時，才動態配置 swap-space，而非虛擬記憶體創建時就建立副本  
      ![](img/Ch10.Linux_Swapping.jpg)  

### RAID
- RAID (Redundant Arrays of Inexpensive Disks)
  - 用多個便宜的硬碟組成
  - 利用 redundancy 提昇可靠度
  - 藉由 parallelism 提昇效能
- RAID 排成不同 levels
  - Striping: 將檔案切割放到多個硬碟上
  - Mirror (Replication): 備份
  - Error-correcting code (ECC) & Parity bit: 產生 redundant 的 code 來做 Error detection 甚至檔案修補
#### RAID 0 & RAID 1
- RAID 0: non-redudant striping
  - 透過平行讀取增加效率
  - I/O 的 bandwidth 與 striping 的數量成正比
    - 若有 N 顆硬碟，則讀寫的頻寬提昇 N 倍
- RAID 1: Mirrored disks
  - 藉由 redundancy 提昇可靠度
    - 讀取的頻寬與硬碟數量成正比
    - 寫入的頻寬保持不變，甚至比單顆硬碟還差一些  
    ![](img/Ch10.raid0.jpg)![](img/Ch10.raid1.jpg)  
#### RAID 2: Hamming code
- 以 Hamming code 將資料編碼後分割成獨立的位元
- 在資料中加入錯誤修正碼
- e.g.: Hamming code(7, 4)
  - 4 data bits (在 4 個 disks 上) + 3 paritu bits (在 3 個 disks 上)
  - 每一個 parity bits 為 3 個 data bits 的線性編碼
- 當有一個 disk 失效時，可以復原資料
  - 可以偵測到兩個 disks(bits) 的錯誤
  - 但只能還原 1 bit 的錯誤
- 比 RAID 1 更好的空間使用率 (RAID 1 100% redundancy v.s. RAID 2 75 % redundancy)  
![](img/Ch10.raid2.jpg)  
#### RAID 3 & 4: Parity Bit
- 雖然 RAID 2 可以偵測硬碟錯誤，但 disk controller 就能夠偵測 sector 的讀取是否正確
  - 因為 low level 的 controller 會提供哪個 sector 壞掉，所以 1 個 parity bit 就足夠修正 1 個 disk 的失效
- RAID 3: Bit-level striping; RAID 4: Block-level striping
  - 優點是更好的空間使用率 (33% overhead)
  - 缺點是計算與儲存 parity bit 的 cost  
  ![](img/Ch10.raid3.jpg)  
- RAID 4: 因為 controller 不需要從多顆硬碟重建 block，有更高的 throughput
  - RAID3 一個 byte 會跨多顆硬碟，速度取決於最慢的那個 bit; 而 RAID 4 可以一次從單顆硬碟讀取多個 bits 的資料  
  ![](img/Ch10.raid4.jpg)  
#### RAID 5: Distributed Parity
- RAID 2~4 都把 parity 集中在單獨的一群硬碟中，對使用者來說這些硬碟等於不存在
- RAID 5 則將資料及 parity 放在一起
- 可以避免過度使用單一顆硬碟
- RAID 3 & 4 讀的時候可以平行處理，但寫的時候因為 parity bits 集中擺放，等於只有一顆硬碟的效率  
![](img/Ch10.raid5.jpg)  
- RAID 5 讀取的 BW 增進 N 倍
- RAID 5 寫入的 BW
  - 法一: 
    1. 讀取所有未變動的資料 (N-2) bits, 
    2. 重新計算 parity bit, 
    3. 將變動及 parity bit 寫回硬碟
    - BW = N / ((N-2) + 2) = 1 倍 --> 完全沒變
  - 法二: 
    1. 只讀取變動及 parity 的 bit, 
    2. 利用更新前後的差異計算新的 partiy, 
    3. 將變動及 parity bit 寫回硬碟
    - BW = N / (2+2) = N/4 倍 --> 有加速效果
- RAID 6: P+Q Dual Parity Redundancy
  - 類似 RAID 5, 但儲存更多的 redundant 資訊來防止多顆硬碟失效的情況
  - 使用 ECE code (i.e. Error Correction Code) 而非 single parity bit
  - Parity bits 通常也 strips 到多顆硬碟中
  - 複雜度及 overhead 高，除非資料真的很重要，一般不會採用這種實作   
  ![](img/Ch10.raid6.jpg)  
#### Hybrid RAID
- RAID 0+1: Stripe then replicate
- RAID 1+0: Replicate then stripe  
![](img/Ch10.raid01.jpg)  
![](img/Ch10.raid10.jpg)  

## References
- [周志遠教授作業系統講義](https://ocw.nthu.edu.tw/ocw/index.php?page=course_news_content&cid=141&id=999)
- [作業系統 Operating System 筆記](https://hackmd.io/@Chang-Chia-Chi/OS)
- [Operating Systems Course Notes Main Page](https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/)
- [作業系統OS](https://hackmd.io/@Pl-eQT9CQaS0jhExKqL8_w/BkhOSR4jW)
- [Operating System Concept 9th edition](https://www.os-book.com/OS9/slide-dir/index.html)
- [Mr. Opengate.](https://mropengate.blogspot.com/2017/09/operating-system-concepts.html) 
- https://hackmd.io/@YiZjennnnn/process_scheduling
- [MarkDown語法大全](https://hackmd.io/@mrcoding/ryZE7k8cN)

鈺淇

2024/07/04 08:35:38

圖片好像突然無法顯示了..

tim108108

2024/08/08 15:46:10

因為Hackmd 檔案有最大限制所以我把搬到 https://blog.d0ngy3.com/p/os/

2024/08/14 08:12:41

收到，謝謝!

顏宏宇

2024/11/16 06:56:25

您好，blog中有些圖檔也是掛掉，想請問會去做修正嗎？

dss