NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories

# NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories ---------- ###### tags: `File System` `USENIX FAST'16` `PIM` ###### Authors: `Jian Xu`, `Steven Swanson` ###### Papers: [link(2016)](https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf)、[link(2017)](https://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf) ###### Slides: [link(2016)](https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_xu.pdf)、[link(2017)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjW__u-ztryAhXbwosBHfqOBpAQFnoECAgQAQ&url=https%3A%2F%2Fwww.sigops.org%2Fs%2Fconferences%2Fsosp%2F2017%2Fslides%2Ffortis-sosp17-slides.pdf&usg=AOvVaw3jZiXQBoc47rUa68qHAP9V) ###### Github: [link](https://github.com/NVSL/linux-nova) ###### [reference link](https://my.oschina.net/fileoptions/blog/1821955) --------------- [TOC] ## Abstract * 原先支援 SSD、Hard Disk 的 file system 會在使用 NVM and DRAM 的 Hybrid Memory System 上造成軟體層的 overhead，因為 NVM 速度變快，原先硬體上的 overhead 過大，因次可以忽略軟體層的 overhead，現在不行。 * 作者提出 NOVA，一個適合 NVM and DRAM 的 Hybrid Memory System 的 file system，並使用改良過的 Log-Structured，因應 NVM random access 快速，所以每個 inode 都有各自的 log，以此來支援 concurrency * 將實際 data 放在 log 外面，減少 log size 並減少 garbage collection 的 cost * Nova's log 提供 metadata、data、mmap 的原子性，並專注於簡單化以及可靠性，將複雜的 metadata structures 存在 DRAM 以加速 lookup operations ## Challenges for NVMM software * Performance * access latency: 過往軟體層的延遲都因硬體層的巨大延遲而被忽略 * 因 NVMM memories 提供低延遲且會在 processor's memory bus上，NVMM file system 需要略過 DRAM page cache 並透過 Direct Access(DAX) or eXecute In Place(XIP) 的技巧直接存取 NVM 上的資料 >去了解 cpu access data 會經過 DRAM page cache 的 flow [name=johnnycck] * NOVA 是一個 DAX file system * Write reordering * 現代的處理器和它們的 cache 可能會為了性能將存儲指令重新排列。 * 對於DRAM來說，CPU是能夠保證一致性的，但對於NVMM卻沒有這樣的保證，一旦 power failure 發生就會有資料不一致性的可能發生。 >CPU 做了什麼設計才能保證一致性？ [name=johnnycck] * Intel 提出了新的指令：clflushopt、clwb、PCOMMIT 來保證一致性 * Atomicity * POSIX 類型的 filesystem 在很多操作上面都需要保證原子性 * 譬如 rename 只有需要修改 metadata * 但許多 atomic updates 會涉及到多個 data 的修改，譬如append file 這種的既要更改 file data，也需要更改 file metadata * 而NVMM 只能提供64 bits 的原子性保證，這對於原子性的設計是一個很大的挑戰。 ## Building complex atomic operations * 現今 file system 為了支援 atomic operations 多使用以下三種不同方法 * Journaling * Shadow Paging * Log-structuring ## File system for NVMM * BPFS * shadow paging * harware mechanism to enforce store durability and ordering * certain operations incur large overhead(ex: move) * PMFS * lightweight DAX file system * Journaling for metadata updates * data updates ar not atomic * Ext4-DAX * extends Ext4 with DAX cabilities to directly access NVMM * Journaling for metadata updates * data updates ar not atomic * SCMFS * use virtual memory management to make file accesses simple and lightweight * no consistency guarantee for metadata and file data * Aeries * implement file system interface and functionality in user space to provide low-latency access to data in NVMM * journal metadata but not support data atomicity or mmap operation ## NOVA Design Overview * NOVA 是個 log-strucured, POSIX file system，基於 LFS，並善用 Hybrid memory system 優勢，不過跟傳統的 log-strucured file system 有很大不同 * 基於三個觀點設計 NOVA： * log 很好做 atomic 但很難做 searching，反之適合搜尋的 data structure(ex:tree structure) 就很難在 NVMM 上建立並達到 atomic * 原先需要清除 log stem 是因為 implement 上需要連續的空間，但有鑑於 NVMM 的 randon access 十分快速，就不再必要 * 同上理由，可以使用多個 log 來管理，也可因此增加 concurrency * logs 在 NVMM，indexes 在 DRAM * log and file data 在 NVMM * radix tree 在 DRAM，讓搜尋快速，radix tree 的 leave 會指向 log * 每個 inode 有自己的 log * 讓 file access and recovery 可以快速同步處理 * 每個 log size small，所以 scan 快速 > log 為何要 per-inode，如果要做 concurrency 不是一個 CPU 有一個 log 就好了嗎？[name=David] > GC 上比較快速，每次GC可以只掃某個 inode [name=承翰] * Use logging and lightweight journaling for complex atomic updates * 為了在寫資料時保證 atomic，NOVA 將 data append 到 log，並 atomically update the log tail，以此來完成寫入動作，避免了 Journaling 的多重複製，又減少 Copy on Write 的 cascading overhead * NOVA journaling 只有處理 log tails，達到輕量化 * Implement the log as a singly linked list * NVMM 使用 4 KB NVMM pages，並用 linked list 完成串接 * 三個優點 * allocating log space 變簡單，因 NOVA 不需大又連續的空間 * NOVA 清理 log 可以以一個小的單位處理 * 回收 log page 簡單，只需移動 pointer * DO not log file data * inode log 沒有 file data * NOVA 使用 COW for modified page，並連結至 metadata，metadata 描述了 update 跟 point to the data pages * file data 使用 COW 有三個優點 * shorter log，加速 recovery process * GC 較為簡單且有效率，因 NOVA 不會因為 log 不夠而回收 log pages * 回收舊 pages 跟 allocate 新的 page 皆較為簡單，只需要在 DRAM 中的 free list 新增或減掉 pages 即可 ## Implementing NOVA ### NVMM data structure and space management * NOVA 將 NVMM 區分為四個部分 * superblock and recovery inode * inode tables * journals * log/data pages * superblock * 包含 global file system 的資訊 * recovery inode * 儲存 recovery information 讓 NOVA 在 clean shutdown 後能快速 remount * inode tables * 包含 inodes * journals * 提供 directory operations 的 atomicity * 剩下的區域包含 NVMM log and data pages * NOVA 在每個 CPU 下分別維持獨立的 inode table、journal、NVMM free page list，以此避免 global locking and scalability bottlenecks ![](https://i.imgur.com/LzP26m8.jpg) * Inode table * NOVA 首先會為每一個inode table 初始化一個2 MB 的inodes array，每一個inode 都是128 byte，所以給定一個inode number，NOVA 會很容易就定位到對應的inode。 * `struct nova_inode { ... }` * 對於新增的inode，NOVA 會使用round-robin 算法(跨 CPU 還是同一個 CPU?)，依次添加到各個inode table 上面，保證整個inodes 的均勻分佈。如果一個inode table 滿了，NOVA 會再分配一個2 MB 的sub-table，並用 linked list 串聯起來。為了減少inode table 的大小，每個inode 上面有一個bit 來表示是否invalid，NOVA 就能重用這些inode 給新的 file or directories。 * 一個inode 包含指向log 的head 和tail 的 pointer，log 是一個由4 KB page 串聯的 linked list，tail 一直指向的是最後一個提交的log entry，當系統第一次訪問NOVA 的時候，NOVA 會通過遍歷head 到tail 的所有log 去重建DRAM 裡面的數據結構。 * `nova_get_inode_address()` * Journal * NOVA的journal是一個4 KB的環形buffer，使用一對<enqueue, dequeue> pointer來操作這個buffer。 * Journal主要是為了保證操作多個inode的原子性，首先NOVA會將更新追加到inode的各自log上面，然後開啟一個 transaction，將涉及到的log tail寫入當前CPU的journal enqueue，並且更新enqueue指針，當各個inode完成了自己的更新，NOVA就更新dequeue指針到enqueue，完成 transaction的提交。 * 當遇到 create，NOVA journals directory's log tail pointer and new inode's valid bit * 當遇到 power failure， NOVA 檢查每個 Journal 並還原所有 Journal's dequeue and enqueue 之間的 update * 每個 core 一次只能開啟一個 journal transaction，但各個 CPU 可以同時並行 * 牽涉到 directory operation 時， kernal 的 VFS 會鎖上影響到的 inodes，因此不會影響到同一個 inode * NVMM space management * NOVA 將NVMM 給每個CPU 分了一個pool，然後將空閒的page list 保存在了DRAM 裡面。如果當前CPU pool 裡面沒有可用的page，就從最大的pool 裡面拿。 * NOVA 在 DRAM 中使用red-black tree 按照address 來存放空閒的pages，正常關機下面，NOVA 會將分配好的page 狀態保存到recovery inode 的log 裡面，但如果是非正常關機，則會遍歷所有inode 的log 並重建。 > 了解一下 radix tree、red-black tree 各適合用在什麼情況 [name=johnnycck] > free list 為何是 tree，為何要合併，資料的連續性為何？[name=David] > 如果有個 file 需要很多 page 來存資料，allocate 只需要 allocate 一次，如果單純建一個 free-list，就需要多次 allocate 操作 [name=承翰] * 最開始，一個inode 的log 只有一個page，當需要更多page 的時候，NOVA 直接使用x 2 的方式，但如果log 的長度超過了一定閾值，就每次只新分配固定數量的pages 了。 ### Atomicity and enforcing write ordering * NOVA 使用 log structuring and jouraling 的技巧，提供對於 metadata、data、mmap updates 的快速 atomicity 操作，技巧使用了三個機制 * 64-bit atomic updates * 對於一些 operation(ex: file's atime)，NOVA 使用 64-bit in-place writes 來直接修改 metadata * updating the inode's log tail pointer 也是使用 64-bit atomic updates * Logging * NOVA 使用 inode's log 來記錄會修改一個 inode 的 operation (ex:write、msync、chmod)，log 獨立於其他的 * Lightweight journaling * directories operations 中需要改變 multiple inodes 的操作(ex: create、unlink、rename)，NOVA 使用輕量化的 journaling來達到原子性 * 最複雜的 POSIX rename operation 會影響到四個 inodes，而 NOVA 只需要每個 inode 以 journal 記錄 16 bytes，8 for log tail pointer and 8 for the value * Enforcing write ordering * NOVA 透過三個步驟保證一致性 1. 將 data and log entry 寫入 NVMM，但還沒更新 log tail 2. 提交 journal data to NVMM 3. 在回收舊的版本前，先將新的 data pages 連結到 NVMM 中 * 如果平台支援 clflushopt, clwb, PCOMMIT，會執行圖二的指令 * ![](https://i.imgur.com/53ojpLk.jpg) * 如果平台沒有支援以上指令，NOVA 使用 movntq、clflush and sfence 的結合來強迫完成 write ordering ### Directory operations * NOVA 針對主要的 directory operations 都有優化，如 link, symlink and rename * NOVA 的目錄包括兩塊，一個是保存到NVMM 裡面的inode log，另一個則是放到DRAM 裡面的radix tree。 * Directory's log 包括directory entires （也就是通常的dentry）和inode update entires。Dentry 包括目錄名，子目錄，子文件，inode 個數，還有timestamp 這些信息。 * Timestamp 可以更新 inode's ctime and mtime * 對於改目錄下面的文件操作，譬如create，delete，rename 這些，NOVA 都會在log 裡面追加一個dentry。對於delete 操作來說，NOVA 會將dentry 的inode number 設置為0，用以跟create 區分。 * NOVA 在 directory's log 中新增 inode update entries，以此來記錄 directory's inode 的更新(ex: chmod、chown) * 為了加快dentry 的查詢速度，NOVA 在DRAM 裡面維護了一個radix tree，key 就是dentry 的名字，而tree 的子節點則指向log 中對應的dentry。 * 以下以 create and delete 做介紹： * create: 1. NOVA 在 inode table 中初始化一個 unused inode for zoo 2. 掛載一個 dentry of zoo 到 directory's log 3. NOVA 使用 CPU's journal 原子性地更新 log tail 並將新的 inode 設置 valid bit 4. 將 file 連接上 DRAM 中 directory's radix tree * ![](https://i.imgur.com/uv40C2c.jpg) > inode 的 offset 是否有 physical address [name=David] > 沒有，都是用 offset 表示，sequential 的連續下去 [name=承翰] * delete: * 在 Linux 中，刪除 file 要做兩個更新： 1. 減少 file's inode 的 link count 2. 從 exclosing directory 中移除 file * NOVA 上： 1. 在 directory inode's log 上掛載一個 delete dentry log entry 2. 在 file inode's log 上掛載 inode update entry 3. 使用 journaling 原子性地更新兩個 log tails 4. 將 DRAM 中 directory's radix tree 做更新 ### Atomic file operations * NOVA 的文件log 包含兩種，一個是inode update entries，另一個write entries，write entry 裡面會描述write operation 以及指向實際的data page。如果一次寫入太大，NOVA 會使用多個write entries，並將它們全部追加到log 後面，然後最後更新log tail。 * File write entries 包含了 timestamp and file size，所以 write operations 可以原子性地更新這些 file 的 metadata * DRAM radix tree maps file offsets to file write entries * 如果 write 非常大，NOVA 將他區分為多個 write，每一個 write 都是原子性地更新 log tail pointer * 上面是一個文件寫入例子，上面的<0, 1>這種的表示<filepageoff, num pages>，也就是page的offset和有多少個page。譬如<0, 1>就表示這個page的offset是0，有一個page，也就是4 KB。當我們要進行一次<2, 2>寫入（也就是在offset為2的地方重新寫入2 pages），流程如下： 1. 使用COW 技術將數據寫入到data page 上面 2. 將<2, 2>追加到inode log上面 3. 原子更新log tail，提交這次寫入 4. 更新DRAM 裡面的radix tree，這樣offset 2 就指向了新的page 5. 將之前舊的兩個page 放到free list 裡面 * 上面需要注意，雖然我們更新log tail 是原子的，但並沒有保證原子更新log tail(?) 和radix tree，這裡跟徐博士討論的時候他說到，使用了一個read-write lock，其實他覺得併不高效，後面考慮優化。 ![](https://i.imgur.com/DU5iQYW.jpg) * For read operation，NOVA 原子性地更新 file inode's access time，並使用 radix tree 將需要的 page 從 NVMM 移到 user buffer ### Atomic mmap * 可以使用 DAX-mmap 技術將 NVMM 的 data 直接 mapping 到 user space，採用這種方式，我們能直接繞過 file system 的 page cache，雖然能降低 cost ，但對於 programmer 來說還是有很大的挑戰，上面說了，NVMM 只有支持 64 bit atomicity operation，而且一些fence 和flush 指令還需要依賴 processor，所以先基於這些初級的機制建立實用的non-volatile data structure，其實是非常困難的。 * 為了解決這個問題，NOVA 使用了一種 atomic-mmap 的機制，它其實是使用了一個 replica pages，然後將 user 實際要修改的 data 放到了replica pages 上面，user 會先對 replica pages 進行操作，當用戶使用了msync 操作，NOVA 就會將相關的修改當做一次write 操作處理，它會使用movntq 指令將replica pages 的數據 copy 到原先的 data pages，然後在原子性的提交。 * 雖然採用這種方式能保證 data consistency，但並沒有 DAX-mmap 高效。而對於通常 DRAM 的 mmap，NOVA 現在並不支持，未來可能有計劃。 > DAX再認真看一下，這樣 msync 不就 = write 的作用? [name=David] ### Garbage collection * GC 上，NOVA 對於 stale data pages and stale log entries 分開處理。對於 data 來說，只要 write operation 產生了stale data pages（就譬如我們前面說的那個例子），NOVA 就直接對data pages 進行回收。 * 處理 inode log 較為複雜，一個 log entry is dead 的條件是他並非最後一個 log entry，以及滿足以下任何一個條件： * 對於 file write entry ，沒有指向任何 valid 的 data page * 對於包含更新inode metadata 的entry，後面有一個新的更新同樣metadata 的entry * 對於一個包含dentry update 的entry，已經被標記為invalid * GC 採用兩種模式來回收 dead entries * Fast GC * 強調速度，並不需要 copy * 如果所有 log page 的 entries are dead，Fast GC 將 page 從 log's linked list 移除 * 圖五(a)顯示了 fast log GC * Thorough GC * 如果整個 log page 裡面live entries 的數量小於 log space 的 50％，NOVA 在完成 Fast GC 後實作 Thorough GC * 將live entries copy到另一個新的log 上面，指向新的log，並且回收舊的log。 * 不會對 log tail 做 ![](https://i.imgur.com/UeibFBp.jpg) > GC 甚麼時候會做? [name=David] > Fast GC 會在 extend log space 的時候做 > Slow GC 會在 log page 裡面live entries 的數量小於 log space 的 50％，NOVA 在完成 Fast GC 後實作 Thorough GC [name=johnnycck] ### Shutdown and Recovery * 當系統關閉並重新啟動之後，NOVA 就會重現去mount，需要重建 in-DRAM data structure，它使用了一種 lazy 的機制，直到 system access 到該 inode 才會去重建radix tree，以此來省 DRAM 空間，也加速 recovery。 * 有區分兩種回復模式： * Recovery after a normal shutdown * 對於正常關機來說，因為NOVA 會將所有的page 分配狀態都保存到recovery inode log 裡面，所以remount 的時候只需要從recovery inode log 恢復就可以了。 * NOVA 使用此模式可以 remount 50GB file system in 1.2 milliseconds * Recovery after a failure * 對於異常關機（譬如掉電）來說，NOVA 就必須重建 NVMM allocator information，NOVA 就需要 scan 所有的inode logs。但這個速度也是很快的，因為兩個原因 * 每個CPU的 inode tables and per-inode logs 可以同步恢復。 * logs 沒有包含 data，所以很短 * 在恢復的時候，NOVA 需要處理： 1. 檢查journal，並且 roll back 所有尚未 commit 的 transaction 以保證 consistency 2. 每個CPU 開啟一個recovery thread 並 concurrently scan inode tables，為每個valid 的inode 掃描log。對於 directory inode 來說，NOVA 只需要它遍歷覆蓋的 pages，並不需要讀取log 的內容。而對於 file inode，NOVA 則需要讀取write entries 並遍歷所有的data pages。 * 在 recovery 的過程中，NOVA 建立一個 occupied pages 的 bitmap ，並且依據這個 bitmap 重建 allocator ### NVMM Protection * 因為在 NOVA mount 時， kernel 會將 NVMM map 到他的 address space。NOVA 要保證他是唯一一個可以 access 到 NVMM 的軟體，避免 kernel 傳送錯誤的 data * 採用跟 PMFS 一樣的保護機制，當一開始 mount 時，整個 NVMM 都被視為 read-only * 當 NOVA 需要寫入 NVMM pages 時，他打開一個 write window by disabling the processor's write protect control (CR0.WP)，當 CR0.WP is clear，kernel software runnubg on ring 0 可以寫到 mark read-only 的 pages * 當寫入 NVMM 動作完成，NOVA reset CR0.WP 來關掉 write window * 因為 CR0.WP 不會被 across interrupt saved，所以在 write window 時 NOVA 會 disaaable local interrupts。 * 因為開關 write window 不會修改到 page tables or TLB，所以 cost 不高 ### 實際程式碼 Trace 心得 #### Mount ```c= /* Documentation/filesystem/nova.txt */ Nova support several module command line options: * metadata_csum: Enable metadata replication and checksums (default 0) * data_csum: Compute checksums on file data. (default: 0) * data_parity: Compute parity for file data. (default: 0) * inplace_data_updates: Update data in place rather than with COW (default: 0) * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as needed (default: 0). You must also install the nd_pmem module as with wprotect =1 (e.g., modprobe nd_pmem readonly=1). For instance to enable all Nova's data protection features: # modprobe nova metadata_csum=1\ data_csum=1\ data_parity=1\ wprotect=1 ``` #### Relax Mode * mount 的時候可以決定是否要做 inplace data write，把原本的 cow 關掉 > Implements DAX read/write functions to access file data. NOVA uses copy-on-write to modify file pages by default, unless inplace data update is enabled at mount-time. There are also functions to update and verify the file data integrity information. > > -- Documentation/filesystem/nova.txt ```c= /* fs/nova/super.c */ nova_mount() nova_fill_super() nova_parse_options() set_opt(sbi->s_mount_opt, DATA_COW); ``` * inode operation 會去看 DATA_COW 來決定要放哪種 (dax or wrap) * 看起來只有差在 `.read_iter` and `.write_iter` (?) ```c= /* fs/nova/inode.c */ case TYPE_CREATE: inode->i_op = &nova_file_inode_operations; inode->i_mapping->a_ops = &nova_aops_dax; if (!test_opt(inode->i_sb, DATA_COW) && wprotect == 0) inode->i_fop = &nova_dax_file_operations; else inode->i_fop = &nova_wrap_file_operations; break; ``` ### file operation * dax ```c= /* fs/nova/file.c */ const struct file_operations nova_dax_file_operations = { .llseek = nova_llseek, .read = nova_dax_file_read, .write = nova_dax_file_write, .read_iter = nova_dax_read_iter, .write_iter = nova_dax_write_iter, .mmap = nova_dax_file_mmap, .mmap_supported_flags = MAP_SYNC, .open = nova_open, .fsync = nova_fsync, .flush = nova_flush, .unlocked_ioctl = nova_ioctl, .fallocate = nova_fallocate, #ifdef CONFIG_COMPAT .compat_ioctl = nova_compat_ioctl, #endif }; ``` * wrap ```c= /* fs/nova/file.c */ /* Wrap read/write_iter for DP, CoW and WP */ const struct file_operations nova_wrap_file_operations = { .llseek = nova_llseek, .read = nova_dax_file_read, .write = nova_dax_file_write, .read_iter = nova_wrap_rw_iter, .write_iter = nova_wrap_rw_iter, .mmap = nova_dax_file_mmap, .get_unmapped_area = thp_get_unmapped_area, .open = nova_open, .fsync = nova_fsync, .flush = nova_flush, .unlocked_ioctl = nova_ioctl, .fallocate = nova_fallocate, #ifdef CONFIG_COMPAT .compat_ioctl = nova_compat_ioctl, #endif }; ``` * 實際要 write 的時候會去看 DATA_COW 來判斷 ```c= /* fs/nova/file.c */ static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; if (test_opt(inode->i_sb, DATA_COW)) return nova_cow_file_write(filp, buf, len, ppos);//cow else return nova_inplace_file_write(filp, buf, len, ppos);//inpace } ``` #### Metadata Protection * metadata_csum 可以在 mount 的時候設成 1(default to 0)，就會做 metadata protection * metadata_csum = 1 * 在更新 metadata 時 1. 複製資料到 primary(tick)，並加入 CRC32 checksum，之後要求 persist barrier，確保資料放入 NVMM 2. 做一樣的事，但放到 replica(tock) * `nova_check_inode_integrity` ```c= /* fs/nova/checksum.c */ /* * Check nova_inode and get a copy in DRAM. * If we are going to update (write) the inode, we don't need to check the * alter inode if the major inode checks ok. If we are going to read or rebuild * the inode, also check the alter even if the major inode checks ok. */ int nova_check_inode_integrity(struct super_block *sb, u64 ino, u64 pi_addr, u64 alter_pi_addr, struct nova_inode *pic, int check_replica) { struct nova_inode *pi, *alter_pi, alter_copy, *alter_pic; int inode_bad, alter_bad; int ret; pi = (struct nova_inode *)nova_get_block(sb, pi_addr); ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode)); if (metadata_csum == 0) return ret; alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr); if (ret < 0) { /* media error */ ret = nova_repair_inode_pr(sb, pi, alter_pi); if (ret < 0) goto fail; /* try again */ ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode)); if (ret < 0) goto fail; } inode_bad = nova_check_inode_checksum(pic); if (!inode_bad && !check_replica) return 0; alter_pic = &alter_copy; ret = memcpy_mcsafe(alter_pic, alter_pi, sizeof(struct nova_inode)); if (ret < 0) { /* media error */ if (inode_bad) goto fail; ret = nova_repair_inode_pr(sb, alter_pi, pi); if (ret < 0) goto fail; /* try again */ ret = memcpy_mcsafe(alter_pic, alter_pi, sizeof(struct nova_inode)); if (ret < 0) goto fail; } alter_bad = nova_check_inode_checksum(alter_pic); if (inode_bad && alter_bad) { nova_err(sb, "%s: both inode and its replica fail checksum verification\n", __func__); goto fail; } else if (inode_bad) { nova_dbg("%s: inode %llu checksum error, trying to repair using the replica\n", __func__, ino); nova_print_inode(pi); nova_print_inode(alter_pi); ret = nova_repair_inode(sb, pi, alter_pic); if (ret != 0) goto fail; memcpy(pic, alter_pic, sizeof(struct nova_inode)); } else if (alter_bad) { nova_dbg("%s: inode replica %llu checksum error, trying to repair using the primary\n", __func__, ino); ret = nova_repair_inode(sb, alter_pi, pic); if (ret != 0) goto fail; } else if (memcmp(pic, alter_pic, sizeof(struct nova_inode))) { nova_dbg("%s: inode replica %llu is stale, trying to repair using the primary\n", __func__, ino); ret = nova_repair_inode(sb, alter_pi, pic); if (ret != 0) goto fail; } return 0; fail: nova_err(sb, "%s: unable to repair inode errors\n", __func__); return -EIO; } ``` * 當要 access 的時候，以 `memcpy_mcsafe()` 複製 primary and replica 到 DRAM buffer，先偵測 media error，沒發生的話就檢查兩者的 checksum 1. 若一個發生 media error or checksum mismatch，複製另一個進去 2. 如果兩個都正確，但兩者資料對不起來，將 primary(代表前一次更新) 複製到 replica 3. 如果兩份都 corrupt，那 NOVA-Fortis return an error. > nova 做 crc32 似乎是直接使用 assembly code，但 crc32 運算不外乎就是 bitwise(XOR) & arithmetic，關鍵 overhead 應該還是 memory movement overhead -- Johnny ```c= /* fs/nova/checksum.c */ /* Verify the log entry checksum and get a copy in DRAM. */ bool nova_verify_entry_csum(struct super_block *sb, void *entry, void *entryc) { /* use memcpy_mcsafe() to detect media error */ nova_get_entry_copy(); /* no media errors, now verify the checksums */ entry_csum = le32_to_cpu(entry_csum); alter_csum = le32_to_cpu(alter_csum); entry_csum_calc = nova_calc_entry_csum(entry_copy); alter_csum_calc = nova_calc_entry_csum(alter_copy); COMPARE(entry_csum & entry_csum_calc); /* PSEUDO CODE */ COMPARE(alter_csum & alter_csum_calc); /* PSEUDO CODE */ /* different handling of COMPARE result */ } /* Calculate the entry checksum. */ static u32 nova_calc_entry_csum(void *entry){ ... ... if (entry_len > 0) { check_len = ((u8 *) csum_addr) - ((u8 *) entry); csum = nova_crc32c(NOVA_INIT_CSUM, entry, check_len); check_len = entry_len - (check_len + NOVA_META_CSUM_LEN); if (check_len > 0) { remain = ((u8 *) csum_addr) + NOVA_META_CSUM_LEN; csum = nova_crc32c(csum, remain, check_len); } if (check_len < 0) { nova_dbg("%s: checksum run-length error %ld < 0", __func__, check_len); } } ... ... } ``` #### File Data Protection * 每個 4KB file page 是一個 stripe，並將他們以 `PR-sized`(or larger)切割為 stripe segments(or strips) * 儲存 parity strip for each file page，且這兩個都會儲存 checksum * read * 將 strip of data 以 `memcpy_mcsafe` 複製到 DRAM，並計算 checksum * 若 checksum match 兩個原先儲存的 checksum 的其中一個，代表資料正確，並 update 可能沒對上的那個 checksum 的資料 * 如果計算後的 checksum 都跟原先兩個原先 checksum 不同，或是發生 media error，則使用 parity 來復原，並跟 checksum 比較並確認是否復原成功，如果超過一個 strip 損毀，將無法復原，則發生 read 失敗 ```c= /* fs/nova/file.c */ do_dax_mapping_read(); /* fs/nova/checksum.c */ /* Verify checksums of requested data bytes starting from offset of blocknr. * * Only a whole stripe can be checksum verified. * * blocknr: container blocknr for the first stripe to be verified * offset: byte offset within the block associated with blocknr * bytes: number of contiguous bytes to be verified starting from offset * * return: true or false */ bool nova_verify_data_csum(struct super_block *sb, struct nova_inode_info_header *sih, unsigned long blocknr, size_t offset, size_t bytes){ /* 類似於上面 inode 的比對 */ } ``` * write * 以 COW 方式確保 atomic update，並計算 checksum and parity