To FUSE or Not to FUSE: Performance of User-Space File Systems

To FUSE or Not to FUSE: Performance of User-Space File Systems ====== ###### tags: `paper` https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf # Abstract 傳統上的file systems都是由OS kernel進行實作，以避免掉message passing與user-space daemons所造成的效能開銷。本篇論文探討了現行主流的user-space file system FUSE框架所帶來的效能減損。實際測試後發現，FUSE所造成的效能減損是compeletly imperceptible的(仍保有高達83%的效率)，所造成的CPU額外使用率開銷也僅增加31%。 # Introduction 近幾年user-space file systems越來越受歡迎的四個原因: 1. 有些stackable file systems可以在現有file system追加特殊功能(例如:deduplication, compression...等) 2. 對學術界以及R&D來說，這種架構可以允許其快速做實驗以及做prototype 3. 有些現存的kernel-level file systems已經被移植到user-space(例如:ZFS, NTFS) 4. 越來越多公司使用user-space file system來實作其產品本篇paper挑選FUSE做分析的原因也有四個: 1. 其架構某種程度上很複雜 2. 內部的細節很少被外界所檢視 3. 由於其複雜的非同步features與user-kernel communications，FUSE的source code很難被分析 4. FUSE的受歡迎程度越來越高，對於其做分析的價值也因此水漲船高實驗人員透過開發一個最上層為Ext4的simple pass-through stackable file system in FUSE並檢視其效能(與原生Ext4做比較)。並使用廣泛的micro- and marco-workloads與不同的硬體配置來搭配basic和optimized configuration的FUSE來做測試。結果指出根據workload與hardware的不同，FUSE可以表現得跟native Ext4一樣好，但worst case卻也能到3倍慢。最後，本篇paper也設計了一套儀表板系統可用於收集FUSE的各項細節表現。透過這個系統，他們得以發現並分析FUSE的各項performance造成的原因與來龍去脈。 # FUSE design 實驗人員選用了Linux kernel 4.1.13版本並搭配FUSE v2.9.4版本來做測試 ## High-Level Architecture <center> <img src="https://i.imgur.com/PJqisBm.png"> </center> <br> FUSE是由部分kernel與user-level daemon所組成。kernel part是用Linux kernel module的形式來實作。載入時，會以fuse file-system driver與VFS對接的身分註冊。對於多數用不同user-level daemon實作的特定file system來說，此fuse driver會以代理的身分運作。若要註冊一個新的file system，FUSE的kernel module也會以 <code>/dev/fuse</code>的block device身分作註冊。這個device作為user-space FUSE daemon與kernel之間的介面。daemon會從<code>/dev/fuse</code>讀取request，並對其作處理，然後寫回replies至<code>/dev/fuse</code>。當user application對mounted FUSE file system做操作時，VFS會route此operation至FUSE's kernel driver。而FUSE driver會指派一個FUSE request structure並把它放到自己的queue裡面。此時，傳送該需求的process通常會處於wait state。接下來FUSE的user-level daemon會透過讀取<code>/dev/fuse</code>從kernel queue裡面去挑選request並執行。執行該request時可能會需要再次進入kernel裡。舉例來說，一個stackable FUSE file system，其daemon submits operation到其所依賴的file system(比如說Ext4)。另一個例子是block-based FUSE file system，其deamon會從block device裡面去讀或者寫。當處理完這些request後，FUSE daemon會寫回response到<code>/dev/fuse</code>。隨後FUSE's kernel driver會標記該request已完成並喚醒原本的發出該request的使用者進程。有些file system的操作可透過user-level daemon來達到complete without communication。舉例來說，讀取某個在kernel page cache的檔案緩存，並不需要再forward至FUSE driver裡。 ## Implementation Details <center><img src="https://i.imgur.com/gLMFlKz.png"></center> 探討幾個FUSE中重要的實作細節： * The user-kernel protocal * Library and API levels * In-kernel FUSE queues * Spliciing * Multithreading * Write-back cache ### User-kernel protocal 如上面Table 1所示，大部分的request都可以直接對應到傳統VFS的操作。這裡只探討較不直覺的request(用粗黑體標示於table 1)。 #### INIT request 當一個file system被mounted時即產生一個INIT request，此時user space與kernel會協調 1. 要用哪一種protocal 2. 兩邊都支援的功能集合(READDIRPLUS或FLOCK support) 3. 多樣參數的設定(FUSE read-ahead size, time granularity) #### DESTROY request 當file system被unmounting時，會產生一個DESTROY request。當收到DESTROY指令時，該Daemon必須負責清除所有必要的資訊。此時對於這個session而言，不會再有任何來自kernel的request，且隨後從<code>/dev/fuse</code>的reads都會return 0，使得daemon能夠正確的離開。 #### INTERRUPT request 只要之前任何的sent request不再被需要時(比如說user process blocked on a READ is terminated)，INTERRUPT request就會由kernel發出。每個request都會有unique的sequence#用於讓interrupt來辨識是否為victim request。 sequence numbers是由kernel所指派且用於找出當user space回應時的那些已完成之requests。所有的request同時也具有一個由64-bit正整數組成的node ID，用於找出kernel與user space中的inode。 #### LOOKUP request path-to-inode的translation是由LOOKUP request來完成。每當有現存的inode被Looked up時(或者new inode被created時)，kernel會把inode放到inode cache中。 #### FORGET request & BATCH_FORGET request 當從dcache中移除inode時，kernel會傳遞FORGET request至user-space daemon。此時該daemon可能會決定取消任何對應的data structures之配置。而BATCH_FORGET則是允許kernel用一條指令即可FORGET多個inodes。 #### OPEN request 當user application想要開啟檔案時，open request就會產生。當要回覆此request時，a FUSE daemon有機會optionally assign一個64-bit的file handle給這個opened file。這個file handle會接者伴隨著所有跟此opened file有關的requests，被kernel所回傳。user-space daemon可以用此handle來儲存per-opened-file的資訊。舉例來說，一個stackable file system可以儲存被其所依賴的file system所開啟的檔案的descriptor當作其FUSE's file handle的一部分。 #### FLUSH request & RELEASE request 每當一個opened file被關閉時，就會產生一個FLUSH request。且當該被關閉之opened file沒有任何referenced時，即產生RELEASE request。 #### OPENDIR request & RELEASEDIR request OPENDIR & RELEASEDIR 和 OPEN與RELEASE有著相同作用，只不過對象是directories。 #### READDIRPLUS request READDIRPLUS request就像READDIR會回傳一個或多個directory entries，不過其還多包含了一些metadata。如此一來便允許kernel去pre-fill它的inode cache #### ACCESS request 當kernel核可一個user process有權限去存取一個檔案，便會產生一個ACCESS request。要處理這個request，FUSE daemon可以實作自訂的權限邏輯。 However, typically users mount FUSE with the default_permissions option that allow kernel to grant or deny access to a file based on its standard Unix attributes (ownership and permisiion bits). In this case no ACCESS requests are generated. ### Library and API levels 概念上，FUSE由兩種level所組成。Lower level關心： * 接收與理解kernel所發出的requests * 傳送合適格式的回覆 * 設置file system設置並mounting * 隱藏潛在的kernel與user space之間的觀點差異 High-level FUSE API建於low-level API且允許開發者跳過實作path-to-inode的mapping。因此high-level API可以直接對file path做操作。high-level API也可以處理request interrupts並提供其他的features。舉例來說，開發者可以使用更常見的chown(), chmod()以及truncate()方法，而不是低階的setattr()。 ### Queues <center><img src="https://i.imgur.com/589hpGi.png"></center><br> FUSE通常會維護五個queues，分別是： * Interrupts * Forgets * Pending * Processing * Background 一個request在任何時間點都只會屬於其中一個queue。 * INTERRUPT request放在interrupts queue * FORGET request放在forgets queue * synchronous request(e.g. metadata)放在pending queue 當一個file-system daemon從<code>/dev/fuse</code>中讀取時，requests會被transferred給user daemon如下： * 優先權會給予存在於interrupts queues裡的requests，他們會比其他的request更早被transfered到user space。 * FORGET與non-FORGET request會被公平地選擇：對於每8個non-FORGET requests，16個FORGET requests會被transfered。這可以允許當其他requests被處理時避免FORGET request突然爆發。 * 在pending queue中最舊的requests同時會移到processing queue。因此processing queue的requests即可以被daemon所處理。 * 如果pending queue是空的則FUSE daemon會被blocked在read call。當daemon對於一個request做回覆時(by writing to <code>/dev/fuse</code>)。對應的request會被從processing queue中移除。 Background queue是用來staging非同步requests的。典型中，只有read requests會go to background queue；writes也會go to background queue但只有當writeback cache被enable時。在這樣的設置底下，來自user processes的writes會先累積在page cache中，之後bdflush thread喚醒後即flush dirty page。當flushing pages時，FUSE會產生非同步的write requests並將它們放在background queue中。在background queue的requests會逐漸移到pending queue裡面。FUSE限制非同步requests的數量，同時 residing in the pending queue to the 可設置的max_background參數(預設為12)。當少於12個非同步requests在pending queue裡時，來自background queue的requests會被移到pending queue。其用意是限制delay所引起來自background requests的重要同步requests暴增。 ### Splicing and FUSE buffers 在基本設置中，FUSE daemon必須read requests from and write replies to <code>/dev/fuse</code>。每個call都需要在kernel與user space間有記憶體的copy。這對於WRITE request以及READ replies是特別有害的(因為它們常常要處理大量data)。 **Splicing**允許user space去transfer data between 2個in-kernel memory buffers且不用copy任何user space的data。舉例來說，一個stackable file system可以直接傳遞data給其依賴的file system。 FUSE用以下兩種之中的一種方式來表現buffer以支援splicing： 1. 一段記憶體片段identified by一個在user daemons's address space的pointer 2. 一個由file descriptor所指向的kernel-space記憶體若user-space file system實作了write_buf方法，則FUSE會剪接來自<code>/dev/fuse</code>的data並直接傳遞至此方法以一個帶有descriptor的buffer形式。FUSE會剪接那些WRITE requests誰含有多於一個page的data。同樣的邏輯也套用至replies to READ requests with more that two pages of data ### Multithreading 當有超過兩個以上的requests available in the pending queue時，FUSE會自動產生額外的thread。每個thread同時只處理一個request。當正在處理request時，這些thread會檢查是否已經超過10個thread，如果超過的話，則該thread exit。 ### Write back cache and max writes 由於基本的FUSE write behavior是同步的且最多4KB的data會被送到user daemon for writing。當在FUSE系統中拷貝大檔案時，<code>/bin/cp</code>會間接導致每個4KB的data被同步送至user space。解決方案是讓FUSE的page cache支援write-back policy，如此一來，file data也可以用較大的chunks of max_write size(最多32 pages)被pushed至user daemon。 # Others * **Stackable file system** * stackable (layered) file system是指其file system本身不儲存任何data，它會藉由其他的file system來幫其做資料儲存。 * 此架構允許在file system中加入一些特殊功能，如壓縮、加解密(請見下面例子) * <center><img src="https://i.imgur.com/DXQhXqn.png"></center> * <center><img src="https://i.imgur.com/uIMV7gA.png"></center> * [參考來源](https://www.fsl.cs.sunysb.edu/docs/sipek-ols2007/index.html)