A Wear-leveling-aware Fine-grained Allocator for Non-volatile Memory === ###### Xianzhang Chen † , Zhuge Qingfeng § , Qiang Sun †, Edwin H.-M. Sha § , Shouzhen Gu § , Chaoshu Yang † ###### tags: `DAC` --- ### Abstract * ##### Problem ###### NVMs are Vulnerable to frequent fine-grained updates, which is caused by low endurances of NVM cells ###### -> *Most existing NVM allocators ignore the wear-leveling problems of NVM cells, especially for that within a page* * ##### Proposed method ###### Wear-leveling Aware Fine-grained Allocator (WAFA). WAFA divides pages into basic memory units to support fine-grained updates --- ### Background and Motivation ##### **CPS applications show two critical characteristics in data requests** * ###### The new data in updating requests are generally fine-grained, i.e., their sizes are less than the size of a common memory page (i.e., 4KB). ###### -> It is necessary to provide fine-grained space management for storing small data * ###### The data store of these applications constantly writes new data to a free space rather than update it in an in-use place. ###### -> The space allocator needs to allocate less worn space for new data requests. --- ##### **Existing fine-grained memory allocator can be classified as two types according to their design objective** * ###### One type is designed for managing conventional DRAM. * ###### Such as glibc malloc, jemalloc, and Hoard * ###### Have no wear consideration, thus are not suitable for NVM management * ###### The other type is designed for managing the upcoming NVM. * ###### Most of existing NVM allocators, such as nvm alloc and Makalu, do not consider wear-leveling for fine-grained updates --- * ###### WAlloc uses a Less Allocated First Out (LAFO) strategy * ###### WAlloc need to compute the average write counts of memory areas and adjust the priority of memory areas according to their write counts ###### -> Overhead * ###### NVMalloc sets a period called “don't - allocate time” for each memory page. * ###### A memory page is allowed to be written at most twice in a period, but the wear-leveling of sub-pages is ignored ###### -> NVMalloc may heavily wear a page by repeatedly using minority sub-pages --- ### WAFA ##### **How WAFA work - 1** * ###### <font color="#f00">WAFA divides the memory pages into multiple units with the minimum allocation size</font> * ###### WAFA groups the pages in multiple buckets according to the maximum consecutive free units in each page * ###### Utilizing *Clockwise Best-Fit (CBF)* strategy to allocate the units * ###### Free and reforming operation --- ###### ![](https://i.imgur.com/9CuPyHp.jpg =50%x) * ###### The minimum allocation size is the same with the size of the last-level cache line for smaller sub-page will cause write amplification by disturbing other sub-page’s bits * ###### Once an NVM page is taken apart, the information of the units of the page is persisted in the last unit for failure recovery, namely, “page metadata” --- ### Terms Scales in a clock : The units in a divided page Run : WAFA allocates the best-fit units in a clockwise order. A cycle of this allocation process is called a run segment : The consecutive free units in the rest of the run ---- ### Fields in metadata page $C_{FU}$ : One byte, means the total counts of *free units* in the page $C_{FF}$ : One byte, record the current “scale”(the ID of the first free unit in the rest of the run) $C_{MS}$ : One byte, store the size of the maximum segment --- ### WAFA ##### **How WAFA work - 2** * ###### WAFA divides the memory pages into multiple units with the minimum allocation size * ###### <font color="#f00">WAFA groups the pages in multiple buckets according to the maximum consecutive free units in each page</font> * ###### Utilizing *Clockwise Best-Fit (CBF)* strategy to allocate the units * ###### Free and reforming operation --- ![](https://i.imgur.com/hNuYarr.jpg =60%x) * ###### The pages are partitioned to 64 buckets according to the size of the current consecutive free units (To fast locate a page that satisfies the requested size) * ###### The buckets are kept on a fixed NVM page for indexing the divided pages and free intact pages * ###### The Pre and Next are two pointers (8 bytes each) for connecting the pages in the corresponding bucket --- ### Avoiding heavy writes on the metadata of NVM pages * ###### A set of temporary buckets for caching the page metadata in DRAM during runtime; Each temporary bucket points to a metadata cache * ###### For each page of the bucket on NVM, we maintain a copy of its metadata in the metadata cache of the corresponding temporary bucket * ###### For each page metadata, we only need to store its bitmap, $C_{FU}$, $S_{ms}$ and $S_{FF}$, Pre, and Next in the metadata cache; Thus, each 4KB page in the WAFA allocator needs 27B in DRAM -> *The DRAM space overhead of WAFA is about 0.66% of the total size of the NVM allocator.* --- ### WAFA ##### **How WAFA work - 3** * ###### WAFA divides the memory pages into multiple units with the minimum allocation size * ###### WAFA groups the pages in multiple buckets according to the maximum consecutive free units in each page * ###### <font color="#f00">Utilizing *Clockwise Best-Fit (CBF)* strategy to allocate the units</font> * ###### Free and reforming operation --- ### CBF ##### *The main idea of CBF is twofold* * ###### First, we allocate best-fit consecutive free units for memory allocation request, through which the memory cells have highest chance to be written after allocation * ###### Second, the units are allocated in a clockwise order --- ### CBF ###### The detailed allocation steps using CBF is shown in Algorithm 1 ![](https://i.imgur.com/o4zO5Le.jpg =350x300) ![](https://i.imgur.com/hNuYarr.jpg =40%x) * ###### $N_r$ is the necessary size of segment(units) for allocation request * ###### The while loop tries to allocate required segment from the best-fit bucket; CBF obtains a free segment from a larger bucket when the current bucket is null --- * ###### Sets the ID of the allocated memory space, namely $A_r$, to the ID $Pos(P_i)$ the first memory unit of the selected segment and updates the total number of free units in page $P_i$ * ###### In line 6-7, if the maximum segment is fully used (i.e. $S_{MS}(P_i)==0$) and the page still has free unit(s), WAFA searches for the maximum segment in the rest of the run and updates cor responding metadata * ###### Finally, the page is moved to the tail of Bucket $S_{MS}(P_i)$ --- ### Analysis on the Algorithm * ###### The timing complexity of step 2 is *O(|Buckets|)* * ###### In step 6, to get the maximum segment in the rest of the run, we adopt an algorithm that can find out the maximum segment in O(n) time by caching the status of each part of the bitmap -> *As a result, the time complexity of CBF is O(|Buckets| + n)* ---- ##### In implementation, we only need 64 buckets. The 8-byte bitmap is divided into four 2-byte slices. Each 2-byte slice can be regarded as an integer range from 0 to 65535. We construct an auxiliary array to store all the 65536 statuses of the slices, including the sizes of three parts of each slice. ##### The size of each part can be represented by at most 4 bits for the length of a slice is 16 bits. The array is placed on NVM for its read-only. Thus, the space overhead of auxiliary array is 96KB, which is negligible for memory management. $T(n) = 4T(4/n) + O(1) ?$ --- ### WAFA ##### **How WAFA work - 4** * ###### WAFA divides the memory pages into multiple units with the minimum allocation size * ###### WAFA groups the pages in multiple buckets according to the maximum consecutive free units in each page * ###### Utilizing *Clockwise Best-Fit (CBF)* strategy to allocate the units * ###### <font color="#f00">Free and reforming operation</font> --- ### Free Operation * ##### For a free operation, WAFA simply updates the page metadata information related to the freed memory units in metadata cache * ###### (1) The allocator finds out the related page metadata by calculating the address of freed units * ###### (2) The allocator resets the related bits of the bitmap in the corresponding metadata cache * ###### (3) The allocator updates $C_{FU}$ in the metadata cache --- ##### After executing lots of allocation and free operations, the divided pages in the system show following states: * ###### Many pages have completed a run and regain lots of free units via free operations * ###### Many pages may contain lots of free units but stuck in the last several units of the run * ###### The maximum segment in a run may be enlarged by free operations but $S_{MS}$ still stores the old value --- ### Reforming Oeration ##### *Calculating the maximum segment of divided pages and backs up the metadata cache in DRAM to the corresponding page metadata in NVM* ###### -> Minimizing the impactof reforming on the performance of allocation and free operations --- * ##### A reforming operation is accomplished as follows: * ###### (1) The daemon thread calculates $TF_K$.avg of bucket k * ###### (2) If $TF_K$.avg > k, for each page in bucket k, the thread calculates the maximum segment of the page. The $S_{MS}$ and Pos of the pages are reset according the maximum segments * ###### (3) Then, WAFA adds the reformed pages to temporary buckets according to its $S_{MS}$ * ###### (4) The thread merges each temporary bucket to the bucket of allocator with the same $S_{MS}$ ---- Let $NP_k$ be the total number of pages in bucket k, the average number of free units $TF_k.avg$ of bucket k is: $TF_k.avg = \sum_{q=1}^{NP_k}C_{FU}(P_q)/NP_k$ --- ### Evaluation * ###### Environment : Linux kernel 4.4.4 with dedicated APIs * ###### Compared with glibc malloc (denoted by malloc), nvm alloc, and NVMalloc ---- * ###### The four allocators are evaluated with two types of workloads * ###### Memcached workloads * ###### composed of 60K inserts and 40K random deletes * ###### Each operation relates to a 10B key and 256B value * ###### YCSB workloads * ###### Four smart home scenarios and generate one thousand record for each scenario * ###### The sizes of the records range from 4B to 32B * ###### Repeatedly store and destroy the mixed four scenarios for one million times --- ### Analysis on the result * ##### Effect on wear leveling - Total write counts of pages * ###### Use the maximum allocation counts of the units in a page as the wear counts of the page * ###### Calculate the write count of every divided page and use the sum of the write counts of all the divided pages as *total write counts* ---- ![](https://i.imgur.com/IEWcIkO.png =50%x) * ###### Memcached workloads : Compared with NVMalloc and nvm alloc, WAFA reduces 17.9% and 38.2% of the total write counts, respectively * ###### YCSB : WAFA shows 81.1% and 40.1% improvement over NVMalloc and nvm alloc, respectively. --- * ##### Effect on wear leveling - Write distribution of memory units * ###### Evaluating the write count distribution of the 64B memory units * ###### We set the free ratio to 80% for basic workloads ---- ![](https://i.imgur.com/fChqBVN.png) ---- ### Memcached | | WAFA | NVMalloc | nvm_alloc | | ------------------------- | ----------- | ------------ | ----------------------------------------------------------------------- | | Written times (per units) | less than 5 | less than 20 | 99% less than 24; four units are written 6211 times (highly imbalanced) | | standard deviation | 0.182 | 1.41 | 15.9 | ---- ### YCSB | | WAFA | NVMalloc | nvm_alloc | | --- | ---- | -------- | --------- | |Written times (per units)|1 to 40|1 to 1043|1 to 912| |standard deviation|2.93|101.1|97.5| --- * ##### Effect on Performance ![](https://i.imgur.com/sOnyWNj.png =50%x) * ###### Memcached : Compared with NVMalloc and nvm alloc, WAFA saves 48.6% and 42.3% execution time, respectively * ###### YCSB : Compared with NVMalloc and nvm_alloc, the performance improvement of WAFA is 16.5% and 40.5%, respectively --- ### The conclusion * ###### We've studied the wear-leveling problem of NVM confronting fine-grained memory allocation requests * ###### In WAFA, the pages are divided into basic units and allocated in rotation for allocation requests (CBF Al.) * ###### The experimental results verified that WAFA can achieve significant total wear counts reduction over existing memory allocators