The Page Cache and Page Writeback(硬碟的快取機制)

# The Page Cache and Page Writeback(硬碟的快取機制) Linux kernel實作了一個disk cache叫做 page cache.目標是要最小化disk I/O的時間.包括了時間局部性的議題，and caching的議題重點 * temporal locality * 因為上次存取過的data機率上很快會被再次存取到的特性，cache hit很重要 * 什麼時候必須把髒掉的page寫回去disk上 Page 跟 Page frame的差別 > Short version: "page" means "virtual page" (i.e. a chunk of virtual address space) and "page frame" means "physical page" (i.e. a chunk of physical memory). That's it, pretty much. It's important to keep the two concepts distinct because at any given time, a page may not be backed by a page frame (it could be a zero-fill page which hasn't been accessed, or paged out to secondary memory), and a page frame may back multiple pages (sometimes in different address spaces, e.g. shared memory or memory-mapped files). # Approaches to Caching Page cache的size是變動的，他可以變大(但消耗記憶體，增加performance)也可以變小(釋放記憶體，減輕記憶體壓力)，被cached的storage device我們稱之為 backing store(disk). 情境: 當一個process執行到一個read() system call，第一個他會先check page cache裡面有沒有要求的DATA, 有的話(cache hit)就直接從RAM裡面拿了(這裡的cache就是RAM的樣子)，沒有的話(cache miss)就要schedule block I/O去disk裡面搬了，data從disk搬出來後就會更新cache，讓之後的存取直接通過cache去存取 # Write Caching 那 process write() to disk時怎麼辦? caches有三種策略 1. no-write * the cache simply does not cache write operations * A write operation against a piece of data stored in the cache would be written directly to disk * invalidating the cached data and requiring it to be read from disk again on any subsequent read. * 這個比較少弄，因為invalidate the cache成本很高，而且也沒處理cache write operation 2. write operation同時update in-memory cache and the on-disk file. * 又叫做write-through cache * 因為他直接write go through the cache to the disk. * 好處是直接保持了 cache coherent. * 不用invalidate cache. 3. write back(都是對page的處理，後面會討論linux的機制-The Flusher Threads) * processes write的時候直接寫進cache * disk該邊不會立即被更新 * 被寫入的page被標記為髒掉了dirty * 被加入一個髒髒串列 * 週期性的再把髒髒串列裡面的東西update到disk上 * 叫做write back * 被處理完的page再被標示回來不再髒髒 * dirty這個詞容易造成誤會 * 就是unsynchronized的pages的意思 * 這個是最好的策略 # Cache Eviction(驅逐) 決定該把哪個page從cache裡面踢出去 LRU 恐龍學過 The Two-List Strategy (LRU變形) * active list * hot pages. * inactive list * 準備被踢走 LRU實作 ![](https://i.imgur.com/fTOW5us.png) 實際的例子: ![](https://i.imgur.com/iGxQ64v.png) ![](https://i.imgur.com/ty52fG8.png) # The Linux Page Cache structure # The address_space Object page cache裡面的page可能包含了好幾個不連續的實體disk blocks.(一個file可能散落在disk上的各處)，所以檢查page cache裡面的data是不是被cached了是很難的，所以只給device name and block number要來index出data更不可能 the Linux page cache uses a new object to manage entries in the cache and page I/O operations ![](https://i.imgur.com/t01fEHR.png) * immap * priority search tree of all shared and private mappings in this address space * A priority search tree is a clever mix of heaps and radix trees. * 一個cached file對應到一個address_space struct * 可以有很多VMA在裡面 * 一個Physical pages對應到多個virtual pages * immap可以很快地找到這些cached file的mapping # address_space Operations The a_ops field points to the address space operations table ![](https://i.imgur.com/hDUY500.png) 這些function pointers用來指向這個cached object的 page I/O動作 ex: ext3 defined in fs/ext3/inode.c * reading pages into the cache * readpage() * updated data in the cache. * writepage() 課本上有詳細解釋過程，詳見課本 Radix Tree ![](https://i.imgur.com/INLGow6.png) 2.6版本前不是用Radix Tree, 是用Hash Table 有些缺點 ![](https://i.imgur.com/d0Q2ATC.png) # The Flusher Threads 處理page cache議題，write有點特別，當page cache裡的data比disk上面的新的時候，我們說這個data是dirty的，是需要同步處理的，要找機會把他寫回disk上。Dirty page writeback會發生在三種情況上 * 當系統整體記憶體不夠時(會有類似watermark的東西)，為了要清出記憶體kernel必須把一些page cache裡的人踢掉，但dirty的人不能踢，你必須先把他寫回disk之後，他是clean的你才能踢 * First, the flusher threads need to flush dirty data back to disk when the amount of free memory in the system shrinks below a specified level.The goal of this background writeback is to regain memory consumed by dirty pages when available physical memory is low. The memory level at which this process begins is configured by the dirty_background_ratio sysctl.When free memory drops below this threshold, the kernel invokes the wakeup_flusher_threads() call to wake up one or more flusher threads and have them run the bdi_writeback_all () function to begin writeback of dirty pages * 當dirty data太久沒處理，系統會定期去清 * a flusher thread periodically wakes up * 這是為了系統掛掉的情況，突然掛掉dirty pages就揮發了，所以需要定期去flush * 在system boot的時候，一個timer就會被initialized去wake up a flusher thread讓他去run wb_writeback() function * 當user process自己invoke sync() and fsync()的時候 > kernel用一個flusher threads來特別監控處理這個問題，他會handle上面這三種情況 ![](https://i.imgur.com/hjvPLy0.png) # Avoiding Congestion with Multiple Threads 舊的方法用一個叫bdflush的single thread處理flush問題，當有heavy page writeback需求時，一個single thread會讓一個congested device queue(用來把I/O request送給disk的東東) block住，其它device queues相對的就idle了，如果系統有多個disk的話，更糟，但disk的throughput相對於RAM就是很糟糕，所以不能用single thread等一個disk的IO，必須用multithread來解決(為了讓其他disk也能讓大屁股動起來) pdflush threads可以讓threads個數動態決定， and each thread tried to stay busy grabbing data from the per-superblock dirty list and writing it back to disk 那如果每個pdflush都要寫到同一個congested queue呢? 這種情況出現的話multithread不會比single thread更好，所以pdflush threads的策略是去write back那些不擁擠的queues，盡量不要去大排長龍