NCTU OSDI Dicussion - File System

# NCTU OSDI Dicussion - File System ###### tags: `OSDI` * 要 mount 一個新的 file system 在原來的 file system ，linux kernel 是如何做到的？ * [來源](https://dri.freedesktop.org/docs/drm/filesystems/vfs.html#registering-and-mounting-a-filesystem)When a request is made to mount a filesystem onto a directory in your namespace, the VFS will call the appropriate mount() method for the specific filesystem. New vfsmount referring to the tree returned by ->mount() will be attached to the mountpoint, so that when pathname resolution reaches the mountpoint it will jump into the root of that vfsmount. * 那可以把 file system 視為一個 device driver 嗎？還是只是 kernel 中的一段 code? * yes, they are all linux kernel modules. By the way, kernel modules are also pieces of code in the kernel. * device driver 也是 kenel 的一段code。 * file system 就是 kernel的一個module。 * 如果把 file system 當做一個 device driver 不會造成重複的 file system code 嗎？ * multiple file system drivers can be co-existing, even on the same device (partitions). VFS->FS#1->Block device driver #1/Partion #1. VFS->FS#2->Block device driver #1/Partion #2. VFS->FS#3->Block device driver #2/Partion #1. * filse sysetm 是一個common module，只有一份code，kernel要好好處裡共用時的問題。 * 對HD driver來說，就只有看到一堆block ID，所以共用也沒問題，上面的file system會轉換inode那些東西成blockID。 * ![](https://i.imgur.com/dpADF2O.png) * 想知道刪除檔案這個動作的細節，是不是使用者刪除時只會在VFS或file system層級刪除，然後等到有新的資料要寫同一個區塊時，被diver視為write的動作直接覆蓋上去? * 每個file system上實作方式不一樣，file system會提供`delete()`給kenel call，實作方式不一樣。 * 可能只是把inode的link打掉而已，這樣最快，所以就可以做recovery。 * 可能做刪除時做到一半，紀錄還在memory，就被關機了，HD的紀錄就不完整。 * 開機時會做inode check。 * Check below ex3 example * ![](https://i.imgur.com/Gr8LYhI.png) * ![](https://i.imgur.com/0QKzLs2.png) * https://www.slashroot.in/how-does-file-deletion-work-linux * 為什麼讀寫資料會有starvation的問題，如果會做merge和sort，最壞的情況不就是掃到下一圈時就讀到了? * read-read starvation，兩個read動作，因為有一個散得很開永遠讀不到，一個都連在一起的block還很靠近磁頭，所以這一段就被merge起來。所以第一個散得很開那個，merge效果就不太好。但假如第一個是先下request的，這樣就很不公平，效能就不好。所以嚴格來說不太算是starvation，因為都會等到，但是就是不公平。 * read-write starvation，會有read queue和write queue。write會被排在比較後面做，read會先做，所以假如read很多就會write queue就會starvation。 * Read/write operations cannot be merged and sorted. If we perform delay write for write operations, the write operations could be starvation. * 檔案系統會知道/管理swapping space嗎，從block device driver的角度來看，swap out跟write block是一樣的? * swapping有swaping partition和swapping file兩種做法。swapping file就會在file system底下運作。swapping partition就是先隔出來的一段空間，在這樣情況下，會是用raw io去做，就是一些block拿來存資料，把memory map對到HD上，不需要file system處裡。file device driver也知道這件事，因為要存取HD就是要靠file device driver去做事。所以也會搶工作。 * “Linux has two forms of swap space: the swap partition and the swap file. The swap partition is an independent section of the hard disk used solely for swapping; no other files can reside there. The swap file is a special file in the filesystem that resides amongst your system and data files.” * ![](https://i.imgur.com/9HUcHjO.png) * ![](https://i.imgur.com/2jwdBfL.png) * 第9個partition就是swapping partition。 * 兩種configuration。 * OCW時提到沒有針對SSD做的file system? * 不過這幾年蘋果做了APFS,但反而在HDD效率不佳,因為metadata散落各處而不是集中放置,所以尋找檔案非常慢. * 老師說這是一個最近很熱門的題目。 * 既然SSD可以random access,那是不是就可以縮小block大小而減少fragmentation? * 老師說問題很好 XD * ![](https://i.imgur.com/wYMN3b0.png) * ![](https://i.imgur.com/8xfXuwt.png) * 4顆SSD，結果跟一般硬碟差不多喔喔喔。 * SSD可能會有read adhead的效果，先往後讀一些東西起來放。 * DMA那些搬動資料的狀態，還是很花時間，一次一次慢慢搬還是會比較久。 * HD上seek跟rotation時間是固定的，一次讀越多就會比較少的seek跟rotation，所以後面performance就會平穩。 * 之前在提到memory management的時候提到不適合在閒置的時候搬動資料來處理fragmentation,因為如果在搬動需要執行的程式會造成問題.但如果在閒置的時候搬動硬碟裡的程式是不是比較可行?空間夠的話copy on write應該能夠較安全的執行. * 硬碟也是有fragmentation。然後就造成performance差。 * ![](https://i.imgur.com/Xfp3R8y.png) * 這樣就是很糟糕的示範，seek很花時間，所以就需要做硬碟重整。就是把inode對應的實體位置給 * yes, check your idle computer, they are usually optimizing your disk space. * 為何dev_read()時, 要將讀取到的資料放在kernel space的buffer中, 再copy至user space, 如果直接copy到user space, 不是比較省時間嗎? (是因為user space有可能被swap到disk嗎?) * yes, you can copy data blocks to user space directly to improve specific user performance. think about why by default, we read to kernel first? * 讀檔案是每個人都可以做的事，所以kernel就會保有cache機制。 * User space pointers may be malicious. * superblock的super_operation(像是alloc_inode, write_inode等等)是可以讓所有OS使用的API嗎? Windows是否可以和Linux共用這些operation? (alloc_inode好像是Linux的function), 我的理解是inode是Linux使用的結構, Windows應該沒有inode, 但之前使用雙系統Linux和Windows時, 兩者卻可以使用同一個硬碟 * ![](https://i.imgur.com/qhUT9J2.png) * 這是一張windows的設計得架構圖。 * Your Linux and Windows share the same disk but different partitions. Right ? * OS needs a file system like root (/) in linux to store system files. But like USB, you use it like data storage thus different OS can share it. * 在file system被mount後, 該superblock會一直存在memory中而不被swap嗎? * Part of mounting informaiton is stored in kernel data strcuture and memory. Need check the details. * 老師說"應該"有些東西會永遠存在kernel中。 * 在多個 core 都在寫入相同檔案的情況下，會有 race condition 的情況發生嗎？感覺如果有 I/O Scheduler，只要保護好 write queue，就不會有同時寫入檔案的情況了？ * ![](https://i.imgur.com/HOHD6hM.png * 通常會直接不能開檔案啦，基本上大部分是不支援同時write啦。 * 在架構圖上有看到 VFS 下面有一層 disk cache，那這個 disk cache 是放在哪裡？如果是存在 kernel 是不是就有讀取檔案大小的限制？ (e.g. 32bit 的 Linux 最多就只能放 1gb 的 cache) 還是說他只 cache 住小檔案而已？ * disk IO這邊講到的Cache都是buffer memory。 * ![](https://i.imgur.com/rXZsczg.png) * ![](https://i.imgur.com/Mk7Q2bl.png) * 這邊有很多很多cache都不是硬體機制的cache * HD上是有OS的，給她自己用的，放是他自己的flash上，就跟RPI很像。就是在那邊收發資料，然後處裡一些硬體像是磁頭blabla之類的東西這樣。所以PC上面就是會有一堆OS在處理各自的事情，然後透過各種bus溝通。 * disk cache 也有分 write through 與 write back 嗎？可以調整嗎？(Solved，有查到調整模式要去 bios 改，指令只能調整要不要 cache) * Ref: http://benjr.tw/20361 * linux會在kernel space先讀取disk資料放在kernel space memory，並在複製到user space給process使用，若有其他user要使用就會再複製過去，請問這樣會發生讀寫不同步的問題嗎? * [已解大多情況會在buffer寫滿後才flush回kernel space，因此會造成不同步的問題，可使用 fcntl flock這類的指令做同步] * 所有的檔案起始都是從superblock開始，那OS一開機要找file system時事怎麼找到superblock的位置呢? 如果是固定的話，那superblock被敲壞換位置要怎麼紀錄呢? * ![](https://i.imgur.com/t64cJUf.png) * Disks use logical block id to physical block id. The user don’t need to record physical block id. * 基本上他就是不能拿來開機啦。但他可以拿來作其他事存資料。 * HD也有可能做保護維護另一張表，雖然OS下block1，可能就是一個logical blockid，所以可能會對到另外一個physical block也有可能。 * virtual file system是user與kernel定義file system的方式，而OS與device定義file system的方式則是依據其格式定義e.g. FAT, NTFS 而這兩者對於OS的差異只是superblock紀錄的super_operation實做的方式不同而已，這樣理解對嗎? * yes. * 老師說很喜歡VFS的設計 * ![](https://i.imgur.com/VWHE47f.png) * 若disk中有部分損壞後換位置存放，那file system紀錄的block還是一樣的? * disk (hardware) ususally do physical translation jobs. * 就是HD的translation table。 * 把USB在一台電腦放些file後拔掉換到另一台電腦，此時另一台電腦是怎麼看到這個USB的super block、FAT table等資料的? * 想清楚，應該複雜但不困難的問題。 * 插了一個USB，OS就會lauch一個USB的機制，發現他是一個massstorage，然後就叫VFS去做是，然後去讀super block，然後就發現他是哪一種file system，然後去用對應的driver去mount出他的資料。 * 如果故意白癡去寫一個錯的file system。OS就會覺得讀起來怪怪的，然後報error，而不會冒險去做讀寫。 * first, device driver. (USB plug & play) (熱插拔) * second, VFS and FS mounting process (super block and file system) * system call 怎麼透過file name知道該file的address在哪？ (solve) * exfat是fat基於ext的好處改進的嗎？ (solve) * 現在有在發展給SSD使用的file system嗎？ (solve) * 對於memory的page size是否會跟disk中的block size會一樣大呢? * Not necessary but must be divisible * 沒有一定要依樣大，兩個的考量不一樣 * memory是和使用方式，使用量有關。 * disk主要是考量performance有關。 * 但兩個關係要能整除，才能mapping。 * 如果說在linux上所有東西包含directory都是file，那為甚麼需要denrey，不能直接用inode去記，再多一個flag去記這個東西是一個dir就好嗎？還是我對dentry的用法有誤解？ * there is no filename is inode * The `dentry` also helps the kernel to speed up search operations and reduce read operations to harddisk. * ![](https://i.imgur.com/VbXxxHf.png) * inode cache和dentry cache應該都是軟體機制吧？如果是的話，那他們應該就是再kernel有一塊空間專門記嗎？是整個系統共用還是各別task有各自的cache？ * inode cahce dcache and buffer cache are all memory and software mechainsms which are shared by all processes. Do not confuse with hardware cahce. * 影片中在block io中有提到，block device driver可能會有自己的schedule方式去HD拿東西，但是HD裡面也有自己的OS可能會再重排一次queue。我的疑問是driver不就是去控制硬體的一個介面嗎？為甚麼HD還要提供功能自作聰明去排序，感覺有點多餘，甚麼情況下會需要這個功能？ * Device maker has a lot of tasks to handle inside the hard-drive. For example, physical block to logical block mapping, scheduling, ... * 可能是logical block ID，硬碟才真是真正知道physical blockId在哪裡。 * 大家太過於站在一個上帝視角去看整個系統。到底為甚麼需要這麼多scheduler，global有global的安排，但是在local上可能又不一樣。 * 實際PC上，有多少OS在同時運行？ * (solved)file system算是一群在kernel準備被call的function嗎？就是等到system call了open之後再call file system function來找block在哪，還是他應該是怎麽存在的？ * file system的fragment問題會使用memory類似的方法來處裡嗎，還是dick空間不值錢就沒關係。 * disk defragmenter * ![](https://i.imgur.com/atwQ5Q6.png) * (solved)既然已經有inode結構、為何還需要dentry，linux不是把directory當作file處理嗎？ * [ref](https://bean-li.github.io/vfs-inode-dentry/) * 在 rpi 上實作 OS 的時候，使用的是 FAT16/32 格式，那一開始開機的時候，rpi 是如何從 SD 卡裡面找到 kernel8.img ，然後讀出來放到 0x80000 的？他有將 FAT 的 filesystem 格式處理寫在 ROM 裡面嗎？ * rpi GPU 裡面的 bootloader 會去 mount SD 卡的一個 FAT32 partition, 然後執行裡面的 bootcode.bin，之後進到第二階段，啟動 cache & memory 將 start.elf 載入 GPU 韌體，之後 start.elf 讀取 config.txt, cmdline.txt 各種組態檔，劃分記憶體交給 CPU & GPU，之後載入 OS，控制權交給 CPU。 * 所謂 driver 應該是一段 code，而有需要的時候 kernel process 去執行他，那請問像是 IO 有 scheduler 的這種東西，應該會去 share 一個 queue，那這個空間會是怎麼來的？還是說 driver 在安裝時會連同整個 kernel 重新 compile，driver 所需要的空間在 kernel load 進去的同時，就會連 kernel 裡面宣告的一些 global 變數一起配置好？ * IO scheduler是kernel一段code。IO queue就是kernel的一個資料結構。 * IO scheduler不是一個demend program。只是適當被叫起來做事的一段code。 * I/O scheduler is part of kernel, not hard disk device driver. I/O scheduler is also a piece of kernel code. I/O queue is also a data structure/memory allocated by block device driver in kernel. * How do you install a driver in linux ? use `insmod`. * 在考慮多核心的狀況下，driver 的撰寫是否需要考慮到平行的狀況？還是說這部份 OS 會控制好，會用 mutex 把 driver 正在調用的部份鎖好？那如果是 driver 自己處理的話，是不是就是要自己使用 system call 把資源鎖好？ * driver本身要處裡好。作業系統都做好機制給driver去套用這樣。softirq，tasklet之類的機制這樣。OS也怕driver亂寫壞掉。 * yes. to improve performance of I/O in multi-core, the driver should be carefully designed (select suitable device driver and IO mechanisms provided by kernel and write the programs in current way) in order to get a better performance. * tasklet, runqueue, softirq * Device File 在 File system中是怎麼被表示的？像是 /dev/null * Please refer Linux sysfs for details. * system file system，organize跟file一樣，但是不存在在真的HD中，而是在memory中。 * https://www.kernel.org/doc/ols/2005/ols2005v1-pages-321-334.pdf * http://beagle.s3.amazonaws.com/esc/sysfs-esc-chicago-2010.pdf * Try type `mount` to find what type of file system is mounted in your linux. * 圖片中 file 裡面有存一個 inode, dentry 裡面也存了一個 indoe 是不是有些冗餘,翻看 kernel source，file_struct 似乎也沒有存放 inode * kernel source: https://elixir.bootlin.com/linux/v2.6.34/source/include/linux/fdtable.h#L43 * [Not sure]當開啟一個檔案假設路徑為 /A/ B/ C/ .../ target_file。系統會分別開啟 A,B, C, ..., target_file。並為他們建立 file descriptor? 還是只會留下 target_file。路徑上資訊會全部丟棄？依目前看到的資料，會逐一搜索放在 dentry cache? * 會逐一搜索放在 dentry cache * 有提到 C D 槽可能會用不同的 file system，那這些 file system 類型是由我們決定還是是要看硬體製造的廠商決定。 * user format * 會不會因為VFS導致file system的無法發揮原本設計時的優點？ * ??? file system may destory the benefits from storage hardware * Has paper try to discuss this: Breaking Apart the VFS for Managing File Systems * Linux 下的TRIM是什麼？為什麼SSD需要使用？（已解） * SSD Read 和 Write 都以 page 為單位，而清除數據(Erase) 是以 block 為單位的。 * 原本的修改方式會造成寫入放大 * Reference: * https://zhuanlan.zhihu.com/p/34683444 * https://zh.wikipedia.org/wiki/Trim命令 * 在monolithic kernel中，file system需要在kernel裡面嗎？（已解） * 用像Ext2Fsd這樣的工具，所以我想應該是有用透過system直接存取block的方式 * FAT 中的 File Allocation Table 的功能跟 dentry 感覺有一點像，他們有什麼關聯性嗎？ * they are similar concepts * They are used in different layers. * 不是很懂 EST 的機制，假設目錄結構是 root/level1/level2/，意思是會有一個 pointer 指向 root；一個指向 level1；一個指向 level2 嗎？ * 在 I/O scheduler 中，會去重排 I/O 不同 file 的順序，有可能會產生 starvation 的問題嗎？ * I/O schedule may only see blocks * 在 Linux 中，當程式建立一個 socket 後，struct file 是怎麼紀錄它是 socket 的呢？對 socket 做讀寫的時候 kernel 怎麼知道要呼叫什麼 handle function？因為講義圖中好像沒看到紀錄 function pointer 的結構。 * [已解] 去看 struct file 的程式碼後，有看到 f_op 這個 member。它會記錄各種對檔案讀寫的function pointer * 當 open 檔案但檔案不在 memory 裡時，是透過 page fault 去跟硬碟要資料後 load 進 memory 嗎？ * ![](https://i.imgur.com/X6R7CMP.png) * 告訴你要去哪裡找code * 所以 FAT 的流程是從固定的 disk sector 拿到 FAT table 的資料到 memory，然後再算出檔案放在 disk 中的哪幾個 sector? * SSD 是如何去拿資料的？ (solved) * HDD 跟 SSD 的 driver 和前端的 file system 主要會有什麼樣的不同？ * Let’s have an open discussion here. * 老師覺得這是個很好的問題。有機會來討論討論。 * 對於如果不用memory map file，需要device先把資料搬到kernel，再從kernel copy到user的說法。所謂的"device到kernel"是指先放到cache，然後"kernel到user"是再從cache搬到user對應的memory page嗎? * yes. * 如果有memory map file，這些資料就會map到cache上，access的時候就不用再從cache搬出來，所以memory map要確保cache上的東西不會被flush掉，這樣的理解對嗎? * “Firstly, a system call is orders of magnitude slower than a simple change to a program's local memory. Secondly, in most operating systems the memory region mapped actually is the kernel's page cache (file cache), meaning that no copies need to be created in user space.” * linux中常常聽到mount file system，實際上到底發生了甚麼事呢？ * Ans:[mount point definition](http://www.linfo.org/mount_point.html) * mount的目的是使OS可以使用自己的file system access特定的file system上的資料。我們可以把新的file system視為是一棵subtree，partition上的directory就可以作為這棵subtree的root，被稱作mount point。值得注意的是，當一個資料夾被mount，原本放在裡面的檔案就不能被access，直到上面的file system被unmount。 * [solved] buffer cache 存著 write 過後的資料的話是不是要提供一些機制寫回硬碟? * [https://www.tldp.org/LDP/sag/html/buffer-cache.html][有說明 linux 會每隔 30 秒呼叫 sync 指令 flush buffer 但如果強制關機或強制把硬碟拔掉就會有資料回不去硬碟] * file-system 對 SSD 或 HDD 或光碟的 block 存取會有不同的實作嗎? * 會，老師上課有講到可能會有特製的 filesystem 對應到特製的 disk，像是 NTFS 比較適合 SSD，FAT32 適合 HDD，但是 SSD 的filesystem 還有改善空間 * ref: https://www.reneelab.com/fat-ntfs-performance.html * linux 為什麼會需要 filesystem in userspace，是為了達到 microkernel 還是有別的用途? * allow users to develop their own filesystem quickly and dynamically without crashing the kernel. * I/O scheduler 都是怎麼決定他的 time slot，像是搭電梯往上的話應該速度要怎麼設定才不會 starvation 或 throughput 太差? * 老師後來在 linux elevator 的演算法有解釋 * 一個file system可以同時跨多個不同的硬碟嗎？也就是windows裡我的電腦看一個硬碟，點進去其實是由多個硬碟組成，但使用者不知道 * RAID or difficult mount points? * In my experience, you can use LVM in linux to achieve formatting a file system across multiple HDD or SSD. * 一個fd list只有0-255，所以最多只能開256個檔案嗎？ * nope that's defualt size. but there is a `rlimit` * if your `rlimit` > 255. fd list will frow like c++ vector. * https://rickhw.github.io/2018/02/24/Linux/File-Descriptor_and_Open-Files/ * 為何有時候會遇到不同file system的資料無法複製到另一個？ * NTFs是微軟的東東，而且沒有doc，所以沒有人敢亂寫，畢竟都是hack出來的東西。去寫就是on your own risk。 * “NTFS is a Microsoft proprietary file system which is not fully documented nor formally licensed. There is no quality assurance program for third party providers of NTFS. All NTFS implementations that are not Microsoft's are reverse engineered. NTFS is subject to change at Microsoft's whim. * As a courtesy, Mac OS will read NTFS files, but will not write them. Writing files runs the risk of destroying the drive data, a risk that Apple wisely avoids. Reading is a non-destructive act, but writing could result in irrecoverable loss of data, especially if the NTFS specification shifts without warning. * NTFS is a pretty sophisticated file system with lots of places to go wrong. Apple allows third parties to shoulder that risk and install drivers for people who insist on modifying NTFS volumes attached to MacOS. I find the risk unacceptable, and prefer to use more standardized network volumes for file transfers between Windows and MacOS.” * I/O scheduler更改讀寫順序會增進效能，但這樣都不會有問題嗎? 可能資料相依性的之類的? * I/O schedule has to make sure there is no problem occurs when reordering. * Noop I/O scheduler是指直接給硬碟幫你處理就好嗎?這樣做好像比較好，直接讓硬碟自己做符合他性質的schedule方式。 * 不同的File system真的會影響讀寫效能嗎? * ![](https://i.imgur.com/DYVGm0w.png) * Linux 檔案系統有個 File hole的功能，可以減少 sparse file的佔用空間，但這樣會不會增加 block device的 internal fragmentation造成更多浪費? * nope, you can still store bio with different IO block numbers. * I/O scheduler 是在 device driver之前嗎？但這樣在一些可以隨機存取(且不會增加額外負擔)的裝置上不是會造成多餘的schedule動作嗎？ * system-wide open-file table 的用處？ * 當有很多filesystem時, 有一個open來時如何知道要看透過哪個filesystem去找資料？是看super block 嗎？ * [solved]資料夾不斷重複擺放也會消耗 block，那麼理論上大量資料夾應該也跟大量 1 bytes 檔案一樣會吃掉很大量的硬碟空間? * Complete fair queue 提到每個人有他的 priority 會盡量公平對待，這樣他不會像前面的方法一樣考量 read 的即時需求或是硬碟的順序嗎? 如果他真的是字面上的 Complete fair，又後者他會在 priority 上去設計讓他還是可以一定程度 read/sequence 的需求? * 老師上課通常不會每一個 block 一個 sector，降低 index / super block 的 overhead，但是竟然如此每次讀檔案都是一個 block = 數個 sector，那麼為什麼不一開始就把 sector 做大，而是讓一個 block = 多個 sectors?