[Scalable high performance main memory system using phase-change memory technology](https://dl.acm.org/doi/10.1145/1555815.1555760)

--- title: image: description: tags: Paper , NVM --- # [Scalable high performance main memory system using phase-change memory technology](https://dl.acm.org/doi/10.1145/1555815.1555760) >[introduction PDF](https://iscaconf.org/isca2009/pres/03.pdf) ## [Storage Memory Paper Pool](https://hackmd.io/@TsenEN/rkToMCQMD) This link will have many papers about Non-volatile memory(NVM)、Process in memory(PIM). ## ABSTRACT The memory subsystem accounts for a signiﬁcant cost and power budget of a computer system. Especially DRAM-based main memory systems, because DRAM in order to prevent data loss when it is static state, it still needs power consumption, but PCM don't need to do that, and have high density, it have high potential to challenge memory hierarchy. In this paper 1. Analyze a PCM-based hybrid main memory system using an architecture level model of PCM. 2. Explore the trade-offs for a main memory system consisting of PCM storage coupled with a small DRAM buffer ( buffer can absorb write operations). Such an architecture has the latency beneﬁts of DRAM and the capacity beneﬁts of PCM. 3. ==propose simple organizational and management solutions of the hybrid memory that reduces the write trafﬁc to PCM== (PCM weak in write operations). :::info This paper evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X, the hybrid memory that reduces the write trafﬁc to PCM, boosting its lifetime from 3 years to 9.7 years. ::: ## 1. INTRODUCTION :point_right:[CPU Basics: Multiple CPUs, Cores, and Hyper-Threading Explained](https://reurl.cc/x0LK4b) Computer systems now tend to ==develop with multiple cores.== This will cause the number of concurrently running applications (or threads) increases, which in turn ==increases the combined working set of the system==, the memory system must be capable of supporting this growth in the total working set. However, with the increasing size of the memory system, a signiﬁcant portion of the total system power and the total system cost is spent in the memory system. Therefore, we need studying ==new memory technologies that can provide more memory capacity, and cheap cost, and Low energy consumption.== Two promising technologies that fulﬁll these criteria: 1. Flash 2. Phase Change Memory(PCM) They can be used to provide a much higher capacity for the memory system than DRAM can within the same budget. <img style = "display:block; margin:auto;" src="https://i.imgur.com/HfzkX5n.jpg"></img> Figure 1 shows the typical access latency (in cycles, assuming a 4GHz machine) of different memory technologies. FLASH：Flash-based disk caches have already been proposed to bridge the gap between DRAM and hard disk, and to reduce the power consumed in HDD, but flash speed slower than DRAM to much($2^8$), it is still important to increase DRAM capacity to reduce the accesses to the Flash-based disk cache. PCM：The access latency of PCM is much closer to DRAM and coupled with its density advantage, PCM not only brings high density but also low cost and low power consumption, Furthermore, PCM cells can sustain 1000x more writes than Flash cells, so PCM lifetime higher Flash. Several challenges to overcome before PCM can become a part of the main memory system： 1. ==PCM being much slower than DRAM.== * adversely impacting system performance. 2. ==PCM write endurance is not as good as DRAM.== * The write trafﬁc to these devices must be reduced. * The short lifetime may signiﬁcantly limit the usefulness of PCM for commercial systems. To be independent of the choice of a speciﬁc PCM prototype, use an abstract memory model that is D times denser than DRAM and S times slower than DRAM(S ≈ 4, D ≈ 4). A main memory system using PCM can reduce page faults by 5X, and hence execute appli-cations with much larger working sets. In this paper believe that PCM is unlikely to be a drop-in replacement for DRAM, ==and show that by having a small DRAM buffer in front of the PCM memory, we can make the effective access time and performance closer to a DRAM memory.== ==This paper study the design issues in such a hybrid memory architecture and show how a two-level memory system can be managed==, and develop an analytical model to study the impact of write trafﬁc on the lifetime of PCM that shows how the “bytes per cycle” relates to average lifetime of PCM for a given endurance (maximum number of writes per cell). ==The paper show that architectural choices and simple enhancements can reduce the write trafﬁc by 3X== which can increase the average lifetime from 3 years to 9.7 years. (Personally think this is a huge contribution to the PCM hybrid system.) ## 2. BACKGROUND AND MOTIVATION Critical computing applications are becoming more data-centric than compute-centric. One of the major challenges in the design of large-scale, high-performance computer systems is maintaining the performance growth rate of the system memory. The disk is ﬁve orders of magnitude slower than the rest of the system ==making frequent misses in system main memory a major bottleneck to system performance.== Furthermore, main memory consisting entirely of DRAM is already hitting the power and cost limits. Facing the problem just described, exploiting Phase-Change Memory (PCM) and Flash, become crucial to be able to build larger capacity memory systems in the future while remaining within the overall system cost and powerbudgets. ### 2.1 What is Phase-Change Memory? PCM is a type of non-volatile memory that exploits the property of chalcogenide glass to switch between two states, amorphous and crystalline, with the application of heat using electrical pulses. The amorphous phase： * Low optical reﬂexivity and ==high electrical resistivity.== * Melt-quenching the material is called the RESET operation, and it makes the material amorphous. * The RESET operation is controlled by ==high power pulses== which places the memory cell in high-resistance state,, and ==logically stores a 0.== The crystalline phase (or phases) : * High reﬂexivity and ==low resistance.== * Heating it above the crystallization temperature (but below the melting temperature) is called the SET operation. * The SET operationis controlled by moderate power, and ==long duration of electrical pulses== and this returns the cell to a low-resistance state, ==and logically stores a 1.== The difference in resistance between the two states is typically about ﬁve orders of magnitude and can beused to infer logical states of binary data. The data stored in the cell is retrieved by sensing the device resistance by applying very low power. The Current-Voltage curve for PCM is shown in Figure 3. <img style = "display:block; margin:auto;" src="https://i.imgur.com/wWUak7u.jpg"></img> ### 2.2 Why PCM for Main Memory? Because the table 1 information has been too long (2009), here I introduce the information provided by the [2017 paper](https://reurl.cc/Q37NLM) Table 1：Characteristics of NVMs According to State-of-the-art Studies <img style = "display:block; margin:auto;" src="https://i.imgur.com/BHvEmuK.png"></img> PCM is more dense than DRAM, and even two logical bits can be stored in one physical cell. This means four states with different degrees of partial crystallization are possible. For more material comparison, you can refer to this paper note I wrote： [Emerging NVM: A Survey on Architectural Integration and Research Challenges](https://reurl.cc/Q37NLM) Here is where PCM needs to be improved: * ==The write latency of PCM is about an order of magnitude slower than read latency.==(The read latency of PCM is similar to NOR Flash, which is only about 4X slower compared to DRAM.) However, write latency is typically not in the critical path and can be tolerated using buffers. ### 2.3 Abstract Memory Model In order to not restrict our evaluation of PCM to currently available prototypes, we have adopted the approach of a generic memory model that combines existing DRAM memory with a PCM memory. The PCM technology is described as ${D, S, W max}$ which means the PCM is $D$ times denser than DRAM, $S$ times slower read latency than DRAM and can endure a maximum of $W max$ writes per cell. PCM technology and describe it as ${D = 4, S = 4, W max = 10 Million}$ The next section describes the organization and system performance challenges in having such a hybrid main memory system comprising of different technologies. ## 3. HYBRID MAIN MEMORY <img style = "display:block; margin:auto;" src="https://i.imgur.com/S0Oygvs.jpg"></img> The paper uses Figure 4 to describe why it will be $(d)$ in the end. $(a)$：This is a traditional system, ==it so slow, because the disk is ﬁve orders of magnitude slower than the rest of the system==, so if working set is to large, access to disk is inevitable. $(b)$：==Flash-based disk caches Bridge the gap of the disk and DRAM==, it can even completely replace the disk, for example,the MacBook Pro laptop(2020) has DRAM backed by a 1TB Flash drive. The disk is heavy and not shockproof, so some systems have only Flash-based storage without the hard disks. However, ==because there is still two orders of magnitude difference in the access latency of DRAM memories andthe next level of storage, a large amount of DRAM main memory is still needed to avoid going to the disks.== $(c)$：PCM can be used instead of DRAM to increase main memory capacity, However, the relatively ==higher latency of PCM compared to DRAM will signiﬁcantly decrease the system performance==, and PCM cannot support a large number of writes because ==the write durability is not high compared to DRAM==. Therefore, to get the best capacity and latency, $Figure 4 (d)$ shows the hybrid system we foresee emerging for future high-performance systems. ==We show that a relatively small DRAM buffer (3% size of the PCM storage) can bridge most of the latencygap between DRAM and PCM.== ### 3.1 Hybrid Main Memory Organization <img style = "display:block; margin:auto;" src="https://i.imgur.com/nqNWp2u.jpg"></img> In a hybrid main memory organization, the PCM storage is managed by the Operating System (OS) using a Page Table. ==The DRAM buffer== is organized similar to a hardware cache that ==is not visible to the OS==, and is managed by the DRAM controller. :::info This is an important technique. If you want to access PCM, you actually access this buffer (The operating system was cheated). This method can improve access performance and reduce the number of writes to PCM, thereby increasing the PCM lifetime. ::: ### 3.2 Lazy-Write Organization When a page fault is serviced, the page fetched from the hard disk (HDD)is written only to the DRAM cache. Although allocating a page table entry at the time of page fetch from HDD automatically allocates the space for this page in the PCM, ==the allocated PCM page is not written with the data brought from the HDD.== To track the pages present only in the DRAM, and not in the PCM, ==the DRAM tag directory is extended with a “presence” ( P ) bit.== * When the page from HDD is stored in the DRAM cache, the P bit in the DRAM tag directory is set to 0. * In the “lazy write” organization, a page is written to the PCM only when it is evicted from the DRAM storage, and the P bit is 0, or the dirty bit is set. :::warning Here to consider whether the updated data is safely written to NVM. Beacuse if the updated data is still in DRAM, and the power is suddenly cut off at this time, your correct data will be lost, and the wrong data will still be in NVM. There are several solutions： 1. Add a small battery. When the power is suddenly cut off, the small battery provides power to the DRAM so that DRAM can write data back to the NVM. 2. [NOVA file system](https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf) whenever the data is updated, the log will be written into the PCM, and the log can be used to redo、undo when the system is restarted. The above solution still has sacrifices： 1. Consider the process. 2. PCM need to write log, this well case reduce PCM lifetime. ::: * If on a DRAM miss, the page is fetched from the PCM then the P bit in the DRAM tag directory entry of that page is set to 1. * When a page with P bit set is evicted from the DRAM, it is not written back to the PCM unless it is dirty. ==Paper assume that tags of both the write queue and the DRAM buffer are made of SRAM== in order to help in proving these structures while incurring low latency. A write queue of 100 pages is sufﬁcient to avoid stalls due to write queue being full. The “lazy write” architecture avoids the ﬁrst writein case of dirty pages. For example: Daxpy Kernel: Y[i] = Y[i] + X[i] Baseline has 2 writes for Y[i] and 1 for X[i] * If hand page(s) for Were installed in the PCM at fetch time, then they would have been written twice, once on fetch and again on writeback at the time of eviction from DRAM. Lazy write has 1 write for Y[i] and 1 for X[i] * The PCM gets a copy of the pages of Y only after it has been read,modiﬁed, and evicted from DRAM. The number of PCM writes for the read only page for X remains unchanged in both conﬁgurations. ### 3.3 Line-Level Writes Here we need to discuss a technology proposed by paper, ==Line Level WriteBack (LLWB)==, ==LLWB wanna reduce write size to PCM==, We propose writing to the PCM memory in smaller chunks instead of a whole page, if writes to a page can be tracked at the granularity of a processor’s cache line, the number of writes to the PCM page can be minimized by writing only “dirty” lines within a page. To do so, the DRAM tag directory shown in Figure 5 is extended to hold a “dirty” bit for each cache line in the page. When a dirty page is evicted from the DRAM, if the P bit is 1 (i.e.,the page is already present in the PCM), only the dirty lines of the page are written to the PCM. When the P bit of a dirty page chosen for eviction is 0, all the lines of the page will have to be written tothe PCM. ### 3.4 Fine-Grained Wear-Leveling for PCM Here we have to consider the problems caused by LLWB, that see the Figure 6 <img style = "display:block; margin:auto;" src="https://i.imgur.com/WsFTw0o.jpg"></img> For both db1 and db2 there is signiﬁcant non-uniformity in which lines in the page are written back. For example, in db1, ==Line0 is written twice as often as average. This means Line0 may get endurance related failure in half the time.== The lifetime of PCM can be increased if the writes can be made uniform across all linesin the page. So the paper propose a technique, ==Fine Grained Wear-Leveling (FGWL), it need a rotation bit generate by Pseudo Random Number Generator (PRNG) & store in WearLevelShift(W) ﬁeld.== Rotation bit：For a system with 16 lines per page the rotate amount is between 0 and 15 lines. If the rotate value is 0, the page is stored in a traditional manner. If it is 1, then the Line 0 of the address space is stored in Line 1 of the physical PCM page, each line is stored shifted, and Line 15 of address space is stored in Line0. When a PCM page is read, it is realigned. The pages are written from the Write Queue to the PCM in a line-shifted format. On a page fault, when the page is fetched from the hard disk, a Pseudo Random Number Generator (PRNG) is consulted to get a random 4-bit rotate value, and this value is stored in the WearLevelShift(W) ﬁeld associated with the PCM page as shown in Figure 5. This value remains constant until the page is replaced, at which point the PRNG is consulted again for the new page allocated in the same physical space of the PCM. A PCM page is occupied by different virtual pages at different times and is replaced often (several times an hour). Therefore, over the lifetime of the PCM page (in years) the random rotate value associated with each page will have a uniform distribution for the rotate value of 0 through 15. <img style = "display:block; margin:auto;" src="https://i.imgur.com/8ezW85w.jpg"></img> Figure 7 shows the write trafﬁc per dirty DRAM page with FGWL, for db1 and db2. ### 3.5 Page Level Bypass for Write Filtering ==Not all applications beneﬁt from more memory capacity.== For example, ==streaming applications== typically access a large amount of data but have poor reuse. ==Such applications do not beneﬁt from the capacity boost provided by PCM.== As PCM serves as the main memory, it is necessary to allocate space in PCM when a page table entry is allocated for a page. But,the actual writing of such pages in the PCM can be avoided by leveraging the lazy write architecture. We call this ==Page Level By-pass (PLB).== When a page is evicted from DRAM, PLB invalidates the Page Table Entry associated with the page, and does not install the page in PCM. ==We assume that the OS enables/disables PLB for each application using a conﬁguration bit. If the PLB bit is turned on, all pages of that application bypass the PCM storage.== ## 4. EXPERIMENTAL METHODOLOGY ### 4.1 Workloads <img style = "display:block; margin:auto;" src="https://i.imgur.com/AJfYh8T.jpg"></img> memory accesses per thousand instructions (MAPKI) ### 4.2 Conﬁguration <img style = "display:block; margin:auto;" src="https://i.imgur.com/EjhX2vO.jpg"></img> ## 5. RESULTS AND ANALYSIS ### 5.1 Page Faults vs. Size of Main Memory <img style = "display:block; margin:auto;" src="https://i.imgur.com/5i9ygLA.jpg"></img> Here you can mainly pay attention to the characteristics of the data set. Increasing the size of main memory have benfit likely d1, d2, qsort, bsearch, Gauss and Kmeans frequently reuse a data-set of greater than 16GB. These benchmarks suffer a large number of page misses, unless the memory size is 32GB. Daxpy and vdotp do not reuse the page after initial accesses, therefore the number of page faults is independent of main memory size. ==Overall, a 4X increase (8GB to 32GB) in thememory capacity due to the density advantage of PCM reduces theaverage number of page faults by 5X.== ### 5.2 Cycles Per Memory Access <img style = "display:block; margin:auto;" src="https://i.imgur.com/9SCyFTK.jpg"></img> Here you can refer to divide them into two groups {8GB DRAM,32GB PCM}：==If the data-set resident in main memory is frequently reused then the page-fault savings of the PCM system may be offset by increased memory access penalty on each access.== For db2, qsort, bsearch, kmeans, and gauss, the increased capacity of PCM reduces average memory access time because of fewer page faults.For db1, even though the increased capacity of PCM reduced page faults by 46%, the average memory access time increases by 59% because the working set is frequently reused. {32GB DRAM,1GB DRAM+32GB PCM}：Having a 1GB DRAM buffer along-with PCMmakes the average memory access time much closer to the expen-sive 32GB DRAM system. ==This is a major breakthrough because DRAM is quite expensive and power-consuming compared to PCM==, and This point out only PCM as the main memory is not good. ### 5.3 Normalized Execution Time <img style = "display:block; margin:auto;" src="https://i.imgur.com/9nPCi5i.jpg"></img> The reduction in average memory access time correlates well with the reduction in execution time. For ﬁve out of the eight benchmarks, execution time reduces by more than 50% with the hybrid memory system. Thus, the hybrid conﬁguration provides the performance beneﬁt similar to increasing the memory capacity by 4X using DRAM,while incurring only about 13% area overhead while the DRAM-only system would require 4X the area. ### 5.4 Impact of Hard-Disk Optimizations <img style = "display:block; margin:auto;" src="https://i.imgur.com/rJ1PMPK.jpg"></img> Thus, regardless of the HDD optimizations, hybrid main memory provides signiﬁcant speedups.Because PCM can overcome HDD slow problem. ### 5.5 Impact of PCM Latency <img style = "display:block; margin:auto;" src="https://i.imgur.com/H12MEnx.jpg"></img> Figure 12 provides a piece of information, only PCM will have a larger access delay, and the lifetime will be reduced a lot, also this thing is paper major mind(hybrid system is better then PCM-only system). ### 5.6 Impact of DRAM-Buffer Size <img style = "display:block; margin:auto;" src="https://i.imgur.com/UU5u0ST.jpg"></img> The DRAM buffer reduces the effective memory access time of the PCM-based memory system relative to an equal capacitybDRAM memory system. Except bsearch, all benchmarks have execution time with 1GB DRAM buffer very close to the 32GB DRAM system. Thus, a 1GB DRAM buffer provides a good trade-off between performance beneﬁt and area overhead. ### 5.7 Storage Overhead of Hybrid Memory <img style = "display:block; margin:auto;" src="https://i.imgur.com/97REBUn.jpg"></img> We assume that the tags of the DRAM buffer are made of SRAM in order to have fast latency. Relative to the 8GB of DRAM main memory for the baseline, all the overhead is still less that 13% (We pesimistically assume SRAM cells incur 20 times the area of DRAM cells [2]). ### 5.8 Power and Energy Implications <img style = "display:block; margin:auto;" src="https://i.imgur.com/dHOFnGK.jpg"></img> For the hybrid memory PCM consumes only 20% of thepower. This is because the DRAM buffer ﬁlters out most of thePCM accesses and our proposed mechanisms further reduce writeaccesses to PCM. These results conﬁrm that the PCM-based hybrid memory is a practical power-performance efﬁcient architecture to increase memory capacity. ## 6. IMPACT OF WRITE ENDURANCE For a F Hz processor that operates for Y years, the PCM main memory must last for Y · F · 2^25^ processor cycles, given that thereare ≈ 2^25^ seconds in a year. Let PCM be of size S bytes which is written at a rate of B bytes per cycle. Let Wmax be the maximum number of writes that can be done to any PCM cell. Assuming writes can be made uniform for the entire PCM capacity, we have <font size = 5>$(1) \frac{S}{B} \cdot W max = Y \cdot F \cdot 2^{25}$ $(2) W max = \frac{B \cdot Y \cdot F \cdot 2^{25}}{S}$</font> Thus, a 4GHz system with S = 32GB written at 1 byte/cycle, must have $W max ≥ 2^{24}$ to last 4 years. Table 5 shows the average bytes per cycle (BPC) written to the 32 GB PCM storage. <img style = "display:block; margin:auto;" src="https://i.imgur.com/IdSec5F.jpg"></img> The PCM-onlysystem has BPC of 0.317 (average lifetime of 7.6 years), but that is primarily because the system takes much more time than DRAM toexecute the same program. Therefore, the lifetime of the PCM-onlysystem is not useful, and in-fact misleading.