Distributed File System - HackMD

<style> H2{color:#BF0060 !important;} H3{color:#009393 !important;} p{color:Black !important;} li strong {color:#4682b4 !important;} .alert-info{ background-color:#e4edf6 ; color: black; } .alert-warning{color: black;} td{font-weight:bold;} </style> # Distributed File System ## **Basics** * **2 major goal of Distributed file system** * Network transparency : Users not aware of the location * High availability ### Naming * **3 Approaches** 1. Concatenate the host name to the name of files * conflicts with network transparency and not location-independent 2. Mounting * Mouning info can be stored at * clients : clients 可以掛在 server 不同地方 * server : 移動檔案時由 server 負責更新 3. Have a single global directory * limit to one server ### **File models** * **Unstructed File** * OS 不知道檔案格式/欄位，由應用程式自行解讀 * UNIX * **Structed File** * OS 知道檔案結構 * 特殊用途 * **Immutabled File** * Maintain history * Easier to support file caching, repication * 磁碟存取增加，所以通常只存近期 history * **Mutable File** * Update overwrite on its old content ### **File Access Model** * 2 Factors : Access Method and Data Units * **Access method** * 3 access Methods 1. ==Remote Service Model== * User process ^read(file,0,100)^⇨ Client ^msg^⇨ Server * `Network overhead ` 2. ==Data-Caching Model== * Client maintains cache * 令要求 100 byte，Server 會回傳 entire file 存進 cache ⇨ User process gets data immediately if cached * `Concurrence control` : 2台對 local cache 寫 3. ==Hybrid Method== * 平時用 data caching，要 write 再轉成 remote service * **Data Units 存檔單位** * 4 possible levels * | | pros | cons | example | | ------ | ---- | ---- | --------| | File | `efficiency`<br>`scalibility`<br>`reliability`<br>`low disk access overhead`|Greater storage on client side|CFS, AFS-2 | Block |`less storage on client side`|Poor performance as requesting entire file|LOCUS, Sun's NFS | Byte |`Max Flexibility`|Difficult for cache mgt|Cambridge File Server | Record |`Best for structured files`||RSS ### **Semantics of Filing Sharing** * **問題** * For distributed system with cacheing client : * client 寫進 cache ，server 還沒更新其他人就 read，因此讀到舊檔案 * **解法** * 4 ways dealing with shared files in DS * |Method|Remark| |------|------| |1. Unix Semantics|Every operation on files instantly seen to all, and thus difficult to implement| |2. Session Semantics|No changed are visible to others until the file is closed| |3. Immutable Files|No updates are possible ; simplifies sharing and replication| |4. Transaction|All change are occurs atomically <br>(類似lock不會被插入)| ## **Caching** ### **Cache Location** * 3 place to reduce 3 kinds of latency * * 放在 mem 都有 reliability 問題 : 未寫入 disk 的資料 crash 後消失 1. ==Server's Main Mem== : * Eliminate <span style="background:#ADD8E6">disk access</span> (of server) * Support Unix-file sharing semantics * Problems of scalability(更多 clients), reliability(server掛了) 2. ==Client's Disk== * Eliminate the <span style="background:#ADD8E6">disk access, network access </span> * Useful for file-level transfer model 3. ==Client's Mem== * Eliminate the <span style="background:#ADD8E6">disk access, network access </span> * 效能最大 * Not suitable for file-level sharing (因為 mem 放不下整個 file) ### **File cache vs. Memory cache(in CPU)** * ||File cache|Mem cache| |-|-|-| |size|up to a full file| L1~L4 = 4KB~128KB| |delay|communication + storage access |local bus + memory access| ### **Writing Policy** * **Write Through Policy** * imeediately send to server * Unix-like semantics * **Delayed Writing Policy** * Mark the modified entry, all updated ectries are gather and send to server, which improves the performance but suffers from reliability. * Write on ejection of cache * Write on close (session semantics) ### **Cache Consistency (CC)** * **名詞** * Server-initiated CC * 由 server 發現並 inform client's cache manager * Client-initiated CC * client 主動 validates the data with server * **情境一 : Concurrent-Write Sharing** * multiple readers and at least one writer 解法： * locking * 有人寫時 not allow file caching * 要寫時 server 通知大家清除cache * **情境二 : Sequential-Write Sharing** * A opens a file that has been modified by B recently, so A has outdated blocks. 解法： * Associate files with timestamp for server to detect inconsistency * Data in B 還在等被 flushed to server (delayed writing policy) 解法 * Whenever a new client opens the file for reading, 其他 clients 都要 flush 自己的 modified cache ## **Replication** ### Multicopy update protocol * **Quorum-based protocol** ### **Fault Tolerance** * **Failure** * server/client crash ⇨ loss state info * Transient faults ex 電源失壓 * **Stateful File Server** * require crash recovery * **Stateless File Server** ## **CODA File Systme** * **Overview** * Centrally administered by Vice File System * Client ：virtue，有 VFS * Cache manager : Venus，有 RPC stub 和 server 溝通 * **Communication** * RPC2 : 進化版 RPC system * periodically inform aliveness * support side-effects : printf, video play * support multicast : sending invalidation msg in parallel * **Sharing Files** * 以 transaction 為單位 * read, write 視為 session * updates are sent back only when the file is closed ⇨ 不可能達成 Unix semantic ### **Transactional semantic** * **serializable** * operations 可序列化因為 session 可以排次序 * **支援斷線** (Allow network partition) * Venus (cache manager) knows necessary locks at the start of a session * Conflict across partition :::info * 解法 1 ：使用 2PL * 如果 transaction 可序列化則有解，==使用 2PL 必定可以序列化== * 解法 2 : 利用 version number * reconnect 時 update 送回給 server 處理 * update accept 條件：client last 版本號 + server 在這段 session 成功 update 的次數 = Current version number + 1 ::: * Disconnected operations * 斷線時進入 Emulation, Hoarding(囤積)狀態 * Reconnct 進入 Reintegration 狀態 * Cache mgt * 為了支援端線， CODA caches 整個 file * priority : user自訂, history遠近, hierarchical(cache 整條 file path) ### **2PL** * A 等 B 做完才 unlock 就不會有 consistency 的問題，為了維持 consistency，transaction 在 unlock data object 前要拿到所有 needed lock * 因此解法為實作 2PL * growing phase * shrinking phase * **Test by serialization graph** * Graph * Arc from T~i~ to T~j~ 表示 T~i~ unlock 後 T~j~ 又 lock this data * ==Topology sort== on the graph * 步驟 : 逐步拿掉 indegree = 0 的 node * 1人指向你則 indegree = 1 * 結果不唯一 * 若 graph 有 cycle ⇨ not serializable * 無 cycle ⇨ topology ==結果 is a serial order== for transaction ### **SS2PL** * 2PL scheme 錯情況 * 出如果 T1 先 release A，T2 就可以讀 A * T1 後來 rollback 所以 T2 讀錯了也 rollback * 導致連鎖出錯 * 因此提出 strong strict 2PL * 所有 transaction 都 commit 才一次 release 所有 lock ### **Replica Control** * **情境** * 使用 read-once write all policy 時(從任一server讀，但寫要寫入全部) * Conflict across partition :::info * 解法：人工解，但可以用 version vector detect 對 3 個object，一邊是 [2,2,1]，一邊是 [1,1,2] ::: ### **Cache Consistency** * **server 負責** * server 追蹤所有有 cache 的 client ，client 更新會 callback * Upon modifications, server 負責通知這些 clients by sending ==invalidate== 訊息 ## **NFS** ### **Architecture 以 Unix 為例** * VFS(virtual file system)在 kernal * transparent access to remote file * application 無察覺，用本來的 system call 即可 * VFS 再判斷需要 ==NFS Client== 還是在 Local File System * NFS client server 用 RPC 溝通 * RPC 可以用 TCP/UDP，並且像 port 一樣是 open 的 * NFS 可以當 client 也可以當 server 並且是 * OS-independent * client, server 用不同 OS 也可以 * **NFS 的 File Identifier 是 File Handle** * 使用 File Handle 是通訊標準，但不同 OS 實作方式不同 * VFS 負責轉換 file handle to local file identifier * p79 * NFS client 先 mount ⇨ server 回傳 handle (file system 進入點) * NFS client 下 lookup, create, and mkdir op，server 回應 file handle * NFS client 對檔案下 op 時，把 file handle 當argument 傳進去，server 才知道對誰動作 ### **NFS 的 Access control** * **問題** * Server 是 Stateless，所以 client 每次透過 RPC 溝通都要送 authentication(user id for access permission) * **解法** * Kerbose * client 先跟 Kerbose 的 autheication server 確認身份，再跟 Kerbose 的 ticket-granting server 要 ticket ### **Mount** * **hard, soft mount** * When a user-level process accesses a file in a file system, it retries until server is available or only tries for a few times. * **automounter** * mount point 動態決定，本來指向 empty * automounter maintain 一個 moint point table listing NFS servers * NFS client resolve file path 時， probe table 裡的 servers，第一個回應的就給 client 當 mount point * 優點：[補] ### **Pathname Translation** * 要在 client 端一段一段找因為 pathname 涵蓋不同 mount point ### **Caching** * **Policy** * Read-Ahead * prefetch pages * Delayed Write * altered page 要被換掉才寫入 disk * sync operation : 每 30s flush to disk * Write 這個 operation 需要送 msg 確保真的有完成 * **安心版** * server reply 前已經 write to disk * **效率版** * Delayed-write scheme 中 data 只寫進 memory cache ，所以為了確定 write 真的有完成，client close file 要發 commit 通知 server 寫入 disk， server 寫完 reply 才可安心 ### **Client cache** * client 負責 polling server * **利用 timestamp validate cached block** * cache 裡每個 data 有兩種 timestamp * TC : last validated * TM : last modified at server * **Valid** * 條件一：T-TC < t，ex 上次 valid 時間 < 3s * t小 consistensy 好 * t大 efficiency 好 * 條件二：TM~client~ = TM~server~，上次更新時間相同 ### **Write Policy** ###### tags: `OS`