[Emerging NVM: A Survey on Architectural Integration and Research Challenges](https://reurl.cc/Q37NLM)

--- title: image: description: tags: Paper , NVM --- # [Emerging NVM: A Survey on Architectural Integration and Research Challenges](https://reurl.cc/Q37NLM) ## [Storage Memory Paper Pool](https://hackmd.io/@TsenEN/rkToMCQMD) This link will have many papers about Non-volatile memory(NVM)、Process in memory(PIM). ## 看這篇的原因了解目前有的材料與現今存在的問題。綱舉目張，先了解主要大綱，細節自能順理而成。 ## 材料 Magnetoresistive random access memory (MRAM) Phase Change random access memory (PCM) Resistive random access memory (ReRAM) Ferroelectric random access memory (FeRAM) ## 1 INTRODUCTION Volume： * 資料太多需要更多的裝置來存取。 * 現今數位數據成長可怕，例如：物聯網(11 billion devices are connected to the Internet to date.)、social networking、video playbacks、transaction processing，這些數據需要擁有更多的儲存裝置來存取。 Performance： * memory hierarchy 在儲存裝置方面有瓶頸。 * 在資料處理的時候會牽扯到 CPU&Memory ，有效率的處理資料可以使我們得到更好結果，但目前CPU速度跟現存的記憶體架構存在很大的 GAP ，而那個瓶頸就在於 storage systems。 Energy： * storage systems 佔整系統快一半的能源消耗、DRAM保護資料機制特別耗能。 * 在美國 Data center 能源消耗占了國家超過 1.5% 並且有論文提到預計每年會再持續成長，其中 storage systems 估計占整系統總消耗 20 ~ 40%，另外也有研究指出 DRAM 在 HPC (High Performance Computing) node 下占了 30 ～ 50% ，這比例還是要看 DRAM 配置與容量。 Scalability： * 無法再縮小體積。 * DRAM 前幾年在體積與容量有驚人的發展，但許多研究指出再 5 ~ 10y 就會到瓶頸了。DRAM cell 需要夠大才能被偵測出來，這剛好違反縮小尺寸的特徵!這造成越高密度的DRAM 成本呈現指數成長。 (e.g., in 2008, using 1 DIMM 8GB cost 212$ /GB while 2 × 4GB DIMMS cost 50$ /GB, and 4 × 2GB DIMMS cost 15$ /GB (Mogul et al. 2009)). The objective of this survey is threefold: 1. 將 NVMs 特性了解的更加透徹。 2. NVMs 在 memory hierarchy 水平與垂直位置的加入。 3. 製作 NVMs 所涉及的挑戰。 * [Mittal et al. (2015b)](https://reurl.cc/9Xpje8) cache 特別層級的 memory hierarchy。 * [Mittal and Vetter (2016)](https://reurl.cc/WL1555) 軟體優化。 * [Xia et al. (2015)](https://reurl.cc/2gm2pa) PCM 相關的所有技術。 ## 2 BACKGROUND ### 2.1 Overview on Memory Hierarchy <img style = "display:block; margin:auto;" src="https://i.imgur.com/FUa5Pho.jpg"></img> Architectural point of view * vertically * 將現今的 memory hierarchy 架構向下移，並取代某些裝置。 * [Emerging Standalone NVM – Moving Beyond the Hype](https://reurl.cc/n05Lnl) 1. **A DRAM replacement in the optimistic case** 2. **A Flash replacement** 3. **A new memory tier between DRAM and storage** * Horizontally * 長在現今 memory hierarchy 旁邊作為輔助( 輔助 DRAM )，系統須要提供兩個 API 來操作此系統。 * 可共用同層 integrated bus。 * 本篇 paper 所提及 SNIA “NVM Programming Model (NPM)” 現在叫 [PMDK](https://pmem.io/pmdk/) ### 2.2 Flash Memory, a Pioneer NVM #### 特性： * 高容量 * 閒置低耗能 * 存取速度快於 HDD (特別是 random reads ) * 存在某些問題：read/write performance asymmetry, a limited lifetime, write/erase operations granularity asymmetry, and so on. #### 2.1.1 Flash Memory Integration 1. As an extension of the memory system (horizontal integration)。 2. 可作為 HDD cache storing frequently accessed code and data，可降低 I/O 次數去 access HDD (vertical integration)。 3. 作為備用存儲裝置，其中 flash SSDs 作為 replacement or to complement HDDs(horizontal integration again). 上述提到的兩種 horizontal integration 是不同的層次！( main or secondary memory )。如果 flash 與 DRAM 集成在一起，則集成對於應用程序是透明的，因為通過專用硬件來管理閃存以補充 DRAM 。 * [Design and implementation of flash based NVDIMM]( https://reurl.cc/7o30m9 ) * [Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach]( https://reurl.cc/Grbj7x ) :::info 補充：NVM 如果跟 DRAM 放在一起，它在存儲角度跟 DRAM 不一樣，DRAM 不在意資料是否 consistency (資料可以掉)， NVM 不行！資料不可以寫一半(只能寫完整或不寫,要有 Atomic)也就是說要 consistency ，同時在 NVM 上層的儲存裝置也需要 flash 來將資料寫入。 ::: 第 2 種水平方式，它無須新的 Interfaces ，只需透過傳統的 storage system interface 來展示給系統，並且透過 standard file system interfaces 就向存取 HDD 一樣的方式存取。簡單來說就是它可以直接插在主機板上(因為 compliant PCIe )，無須任何額外 tool 來輔助它溝通做事，而且可以提升效能。 :::info 小結一下：上方所提及的 API 是以軟體層面來提供 user mode programmer 來管理 NVM ，而這邊是直接以硬體(DIMM controler)控制 NVDIMM (要注意材料只是NV就算)，以硬體控制的話對 programmer 來說就跟 DRAM Same Same . ::: 但是就成本考量，Flash 比 HDD 貴太多並且再存儲上也有 lifetime 限制，所以仍然無法取代 disk。 #### 2.1.2 Flash Memory Properties and Management NAND flash memory 相較於傳統 magnetic storage systems 擁有很多獨特特性，這使多位科學家為了整合它提出許多特別解法！ #### 特性 : * [Flash Memory Integration](https://reurl.cc/q8OgWD)。 * **Random reads :** Flash memory 在 read access 比 HDD 還來的優勢，HDD 在 reads 時很吃資料的實體位置，尤其在 random reads 時效能展現很差，故 OS 發展很多提升效能的策略(e.g., I/O scheduler、page cache )又或者 disk scheduling (e.g., FCFS、SCAN、C-Look ) ，在使用者層次也有許多 applications also suppose an underlying HDD and thus try to maximize sequential requests while minimizing random ones such as database management systems (e.g., sorting algorithms)，and big data applications (e.g., MapReduce)。 * **I/O performance asymmetry :** flash memory 一般情況下讀快寫慢，但有可能某些設計可以使寫比讀快(e.g., write buffer )，其中在 write 會有寫入放大問題！故存在一個研究領域( flash write amplification )來特別研究。 * ****Sequential writes in a block :**** 由於存在電干擾導致寫入錯誤，writes must be performed sequentially within one block，especially for MLC(Multi Level Cell) technology，並且由Error correction codes (ECC) are implemented in hardware to cope with such errors. * [SLC,MLC,TLC](https://www.transcend-info.com/Support/FAQ-393) * **Low power consumption :** 比 HDDs 省電。 * **I/O interface :** SSD 的 I/O interfaces similar to HDDs 所以無需修改 OS，但這可能會讓人誤解成使用 flash memory 必須跟使用 HDDs 一樣。 * **Shock resistance :** 傳統 HDDs 不耐撞因為有精密機械元件，而 flash memory 彌補了這個缺點。 #### 設計 flash-based solutions 必須面對下列問題： **Erase-before-write limitation :** 資料不能像在硬碟中那樣直接覆寫，並且如果要原地修改資料則必須先 erase whole block (記憶體只能以較大的單位區塊（由多個頁面組成）擦除)，如果要這樣操作會很耗時! * 目前解法: [mapping scheme & garbage collector](https://reurl.cc/m9GbEW) * 更新資料直接寫在異地，並 invalidating the old data，之後利用 garbage collector 來回收 invalidating data . **Write/Erase granularity asymmetry :** Writes are performed on pages，while erase operations are realized on blocks，Flash memory blocks 由給定數量的頁面大小在2到8 KB之間 (powers of 2)。 **Limited number of Erase/Write cycles :** * single level cell (SLC),average number of write/erase (W/E) cycles is approximately 10^5^ * multi-level cell (MLC),average number of write/erase (W/E) cycles is approximately 10^4^ * triple level cell (TLC),average number of write/erase (W/E) cycles is approximately 5000 * 利用 wear leveling techniques 來平衡每個區塊的寫入狀況。 ### 2.3 NVM, Why and How? NVM 要插入 memory hierarchy 有三種方式：processor caches, main memory, or storage systems #### 2.2.1 As a Storage System * vertically : such as PCIe-based flash devices. * horizontally : similar to the flash memory( Because some constraint mamngent need to implemented other work ). #### 2.2.2 As a Main Memory 1. 可直接取代 DRAM & hybrid main memory(混合在 DRAM 旁邊) 其中因為材料差異要根據 peculiar characteristics 來提供額外服務。 2. integrated vertically with DRAM ：DRAM 為 NVM cache 可以減少存取 primary memory latency(相較於 HDDs)。 3. horizontally：可以共用 DRAM bus，資料放置問題可以透過 Hardware 管理( 就不干 OS 屁事 )，或由 OS 管理( 對應用層來說很抽象，或者會提供一些指令來給 applicative layer 來促進放資料的決策 )。 #### 2.2.3 As a Processor Cache 作為 first-level cache 須要低延遲高耐久，光這兩點很少有 NVM 可以達成，但 NVM 符合 last-level caches high density的要求( 主要可以避免資料搬移 )，[NVMs integrated in last-level cache](https://reurl.cc/VXjAaR)，但作為最後一層 caches 勢必要承受很多 write operations 。 Table 1：Characteristics of NVMs According to State-of-the-art Studies <img style = "display:block; margin:auto;" src="https://i.imgur.com/BHvEmuK.png"></img> [Table 2]去看論文拉可以關注 vertical integration of MRAM in processor caches and the horizontal integration of PCM in main memory . PCM 相關的技術 ReRAM 也相關 ## 3 PCM (PRAM) #### 優點: 1. small-sized cells 2. excellent scalability 3. fast random access 4. moderate write throughput 5. good retention time 6. [erase-less programming (no need for an erase operation to update data)](https://reurl.cc/WLrE7D) ### 3.1 Basic Concepts #### 3.1.1 PCM Technology PCM 透過電阻值來定值，透過電流加熱來改變型態(阻值)，非晶態(短而高的電壓)為高阻值(1)、結晶態(長而低的電壓)低阻值(0)，所以通常 write performance 比較久都是在 SET Pulse。 The ratio between the resistance of the material in a SET and a RESET phase is between 10^2^ and 10^4^，因為這電阻值級距過大，可以在製程上呈現 MLC、TLC ，但這中間的值可能會需要多層的迭代與驗證中間可能會需要少數 SET operations 參加。 <img style = "display:block; margin:auto;" src="https://i.imgur.com/cHV5PtF.png"></img> #### 3.1.2 Write Endurance PCM 存在一個嚴重的問題就是"耐久度 "sustain a limited number of write operations (generally 10^8^)，會這樣的原因在於要一直重複加熱導致熱膨脹與收縮造成 degrades the electrode-storage contact ，這會降低 cell 寫入電流的可靠性！但它還是比 flash memory 還耐操(around 10^5^ cycles)，但比 DRAM 來的糟糕(10^15^)這耐久屬性是 PCM 整合進 memory hierarchy。另一個會影響到 Write Endurance 是關於 RESET optimal current 根據[Xue et al.(2011)](https://reurl.cc/e83WzW)跟[Kim and Ahn (2005)](https://reurl.cc/OqAEMD)這兩篇指出如果妳為了 optimal current 這會導致 quick degradation of the memory cell, thus decreasing its endurance。 ### 3.2 Integration Options 剛才指出了 PCM 要面臨的問題 The write endurance and high write latency are the main issues，所以它在挑戰 memory hierarchy 自然就可以排除某些位置(e.g., first levels of the processor cache [Mittal et al. 2015b](https://reurl.cc/Q3LWaM) 可以參考這篇) <img style = "display:block; margin:auto;" src="https://i.imgur.com/dUC4K9P.png"></img> #### 3.2.1 PCM in Main Memory ##### Figure 3 1. as a replacement of the DRAM 2. horizontally at the same level as the DRAM 3. vertically with the DRAM 當 PCM 跟 DRAM 放在同一層，資料的放置位置就須探討！理想上會將 write intensive workloads/data 放到 DRAM，並將比較重要不能流失的 data 放到 NVM 或者使用 NVM 來降低電源消耗，例如 [Wei et al.(2015)](https://reurl.cc/r8D3qx)，他們將 file system meta data 放置於 NVM (called FSMAC)從而提昇效率，而 NOVA 更潮 meta data 放 DRAM ， the data & meta data 放 NVM，PCM 也可以當 main memory 的 checkpointing 也適用 fault tolerance issues。 * 放置策略： 1. specific API * [Kannan et al. (2016)](https://reurl.cc/j51QO1) 2. OS * [Lee et al. (2014)](https://reurl.cc/Grx5Ov) * [Salkhordeh and Asadi (2016)]() * [Wei et al. (2015)]() 3. memory controller * [Dhiman et al. (2009)]() * [Qureshi and Srinivasan (2009)](https://reurl.cc/Xkj9be) * [Lee et al. (2009)]() 當 DRAM 是 PCM 的 cache ，最大的優勢是 DRAM 吸收大量的 Write operations，彌補了 PCM 的缺點，也有 [Qureshi and Srinivasan (2009)](https://reurl.cc/Xkj9be) 指出給予 small DRAM buffer 整合 PCM storage 這樣整合出來的系統可以降低 page faults by 5x、系統速度 3x、並且 PCM Lifetime 也上升許多 (3x)! PCM performance depends on the write patterns [Awad et al. (2016)](https://reurl.cc/ygrl9M) 指出 PCM integration 對 OS mechanisms 會造成重大的改變(e.g., page prefetching and page replacement policies)，[Wu et al. (2016)](https://reurl.cc/VXDqAn)使用了一小部分 DRAM 來加速對 NVM 的訪問，同時將另一部份水平放置在 PCM。[Park et al. (2015)](https://reurl.cc/9XGQj8)這篇嘗試將 PCM 取代 DRAM 透過 LPDDR2 嘗試優化 row buffer。 #### 3.2.2 PCM in Storage Systems 最常拿來比就是 NAND flash memory(也只有它能比拉)，PCM 直接抓著 flash memory 痛出打，ability to perform byte-wise random access and direct in-place updates。 PCM-based storage systems 已經有基本原型 [Akel et al.(2011)](https://reurl.cc/GrxRO3) Onyx is a PCM-based storage array interfaced in PCIe ，並且被證明在某些特定的 workload patterns (irregular and read dominated)效能比最新的 SSDs 還來的牛逼。 * 許多研究都專研於 vertical integration of PCM within storage systems ， [Sun et al.(2010)](https://reurl.cc/bR2j13) 這篇論文將 PCM 作為 SSD 的 Cache 帶來了下列好處： 1. 可以吸收 log updates 針對了 flash memory 的對於寫入放大的缺點，提昇了 SSD 的 lifetime 。 2. 在效能方面也提昇了不少尤其針對 Write operation 因為 PCM 支持 in-place updates 所以很快！ 3. 在 energy 也無話可說 PCM 就是比 DRAM 省電 4. [Liu et al. (2011)](https://reurl.cc/bR28Q6) PCM 也可以避免措不急防的斷電讓 SSD save data currently in transition (in temporary buffers) to the non-volatile NAND media，可以防止資料流失。許多 storage systems 都是基於 PCIe 作為接口有作者指出這會是限制效能(in terms of latency, for instance)的一大瓶頸。 * 從 OS 的觀點，many layers of the storage system software stack were devised with HDD in mind, so some micro seconds of latency can be tolerated，但是當 NVMs 加入後就會有許多地方需要重新考慮(e.g., underlying hybrid architecture 要如何抽象化到 OS 、如何連接不同的 storage systems、要使用怎樣的協定與策略 ) [Park et al.(2010)](https://reurl.cc/XkjvGj)作者提出了 PCM 在 Linux 中的兩種可能的 integrations(這篇論文提出如何在 Linux 去整合 hybrid memory systems ): 1. Having both DRAM and PCM in main memory and performing data placement according to the I/O type，這樣的方式就是省電(compared to full DRAM implementation)，Page placement 會根據執行中的程式分段來置換。 2. As complementary storage with traditional storage systems. Here, data placement is based on putting small-sized random accessed data in the PCM. #### 3.3 Open Research Questions 主要是在說 PCM 取代 DRAM 面臨的問題與解法 #### 3.3.1 Optimizing the Write Latency write latency 是因為 PCM 材料所引起的，所以想優化它可能要想辦法優化其他 level： 1. Reducing the write bits(只改想要改的 bit 位置) [Dong et al. (2015)](https://reurl.cc/R1X11z) 2. 利用 SET and RESET operation properties 來提昇 level of parallelism of the writing bits [Yue et al. (2013)](https://reurl.cc/d5W5XD) * 它利用了寫入 0 和寫入 1 間的速度和功率差異。寫零比寫零要花費更多的時間，但要消耗更少的電流。 * 將寫入分為幾個階段，在 write-0 階段，所有的0均以加速的速度寫入，而在 write-1 階段，所有的1均以提高的並行度寫入，而不違反功率約束。 * 新的編碼方案，可通過進一步增加可並行寫入PCM的位數來提高 write-1 級的速度。 3. hiding the write latency for read operation optimization(已達成)，當 read operation arrives 先暫停 long write operations ，直到讀完才繼續寫入 [Qureshi et al. (2010)](https://reurl.cc/14m4eD) #### 3.3.2 Optimizing the Endurance 取代 DRAM 要面臨的最大問題是 write endurance，但已經有許多技術來處理這種問題! Solutions for optimizing the endurance 可以分成三種子分類： 1. 透過修改電流特性來達到提昇 endurance，[Jiang et al. (2012)](https://reurl.cc/XkVk1R) 這篇作者透過 reduces the RESET current 來增加 PCM lifetime。 2. minimizing the number of write operations，利用 inserting a DRAM buffer 吸收 frequent writes;也可以避免 writing of unmodified memory bits 或者避免 writing of useless data from the cache，再者也有 data compression [Sun et al. (2011)](https://reurl.cc/e8ODGQ) 3. PCM endurance can be improved through wear leveling techniques，two main classes of wear leveler： * rely on memory block states，switching hot (frequently updated) and cold data。 * wear levelers that randomly swap data independently from their access pattern ，those that do so periodically。 #### 3.3.3 Write Disturbance PCM 如果 inter-cell distance is too small，RESET operation 會需要很高的溫度可能因為熱傳導，影響到旁邊的 cell ，不小心讓附近 cell 變成 SET operations，This effect could limit the technology scaling under 20 nm，目前最新解法： 1. imposing a minimal inter-cell space 2. 設計寫策略與資料編寫技術 #### 3.3.4 Energy Saving PCM 最大優勢就是在靜止時幾乎不耗電(相較於 DRAM)，但是在 write 時比 DRAM 還要耗電！目前有兩種方式來解： 1. 減少寫的數量 2. 減少寫的時候所花的電量要達成上面所說的，可以利用 avoiding the updating of all bits，avoiding unnecessary writes，如果針對第2種，可以利用 write asymmetry (SET and RESET)來減少 energy，因為 RESET 會花很多 energy (相較於 SET operation)，在 MLC PCM writing of a cell line begins with a RESET operation，隨後會是 variable number of SET operations。 [Jiang (2014)](https://reurl.cc/m930AM)推薦 schedule RESET operations 盡量在不同的時間不同的 cell ，這樣可以減小 size of charge pumps(這是輔助 RESET operation energy 的東西)，size 變小後可以減少浪費的 energy，此優化取決於以下事實：要執行的 SET 操作次數可能會根據目標值而有所不同，就結果，可以容易地在時間上移動一些 RESET 操作，而不影響 the cell line 寫操作的持續時間。 ## 4 MRAM 目前比較有前途也比較成熟的 NVM 材料 ( Magnetic or Magnetoresistive RAM) very attractive properties: high density, fast read, low leakage, and high endurance，STT-RAM 是 MRAM 的變形。 ### 4.1 Basic Concepts #### 4.1.1 MRAM Technology 基於特殊材料的磁性來實現。資訊是透過 [Magnetic Tunnel Junction (MTJ)](https://reurl.cc/Q3baZ9)穫得而不是電荷。Magnetic orientation 可以使用 electrical signals 來感應或者控制。 MTJ 透過三明治實現 storage function，兩層的 ferromagnetic layers 中間夾一層 oxide (tunnel) barrier layer;其中一層 ferromagnetic layers 的磁化方向為固定的稱為<font color =red> reference (or pinned) layer</font>，另外一個則相反稱為 <font color =red>free layer</font> * The MTJ 如果兩個 layers 磁化方向一樣則為高阻態稱為 parallel state and represents the logical 0，如果磁化方向相反則為高阻態稱為 anti-parallel state and represents the logical 1 <img style = "display:block; margin:auto;" src="https://i.imgur.com/DufNAfB.png"></img> [Kultursay et al. (2013)](https://reurl.cc/0OXa3A) Spin Torque Transfer RAM (STT-RAM) free layer 的方向會根據 bit line(BL) 跟 source line(SL) ，在這 BL 跟 SL 之間施加高正電壓差就為0，反之負電壓為1，為了確認反轉，必須將發出的電流維持一段時間(2ns ~ 12ns)稱為 write pulse width 其電流值相當於 threshold current (100μA ~ 1000μA) STT-RAM 不同於原本的 MRAM ，原始的 MRAM 利用 external magnetic field 來改變方向，而 STT-RAM 是透過 current of polarized electrons 。讀取動作會在 SL、ML 之間加一個小電壓，這樣就會有電流流過 MTJ，這電流值會根據當前 MTJ 的電阻而有所變化(V = IR)，之後這個電流跟參考電流相比就可以得到 0、1。 #### 4.1.2 MRAM Endurance ##### STT-RAM 的 endurance value 大約 10^15^ in write cycles 但它仍有兩種問題： 1. MTJ cells 仍有熱傳導問題，這會導致資料流失！所以在製程上面如果要更小就要突破這個問題。 2. 高寫入電流會給 memory cells 帶來壓力，會降低 junction並且久了會限制其完整性。 3. 在做 read operation 也需要電流，這也可以會改變磁場的強度(read disturb)。 * write current 跟 MJT area 成正比，所以東西縮小了寫入電流自然變小，但需要注意 read current 需要都夠大才足以讓 sense amplifiers 偵測到，這方面就要取捨! ### 4.2 Integration Options <img style = "display:block; margin:auto;" src="https://i.imgur.com/BHvEmuK.png"></img> 雖然 STT-MRAM write endurance 跟 DRAM、SRAM 不相上下，但是那只是理想值，但我們可以透過某些方式來提升它們耐久度(e.g, compromising on the retention time )，但是即使提升 write endurance ， STT-RAM 在 write latency 還是太慢(相較DRAM、SRAM)。目前目前最新的研究都是偏向primary memory or cache memory level。 #### 4.2.1 MRAM in On-Chip Cache ##### 兩種整合在 cache 的方式： 1. in all levels of the cache hierarchy 2. in the last levels of the cache (LLC) 第2種方式會出現主要是剛才提過 STT-RAM write latency 還是太慢支撐不了 first-level cache 的存取量。雖然 write latency 是一大缺點，但是 STT-RAM 在能源方面的消耗非常的少 [Senni et al. (2014)](https://reurl.cc/14mW3W)、[Komalan et al.(2014)](https://reurl.cc/m93Qrl) ，論文提到在相同的 tested workloads 下效能並沒有差 SRAM 太多，但是 energy 消耗卻大幅的下降!(他在靜態時減少了很多消耗，但是在動態的時候會花費更多 energy(猜測是因為要花比較久時間寫))，其中L1 Instr-Cache 被 STT-RAM 取代，方法是擴展 miss status holding registers 以應對NVM的寫入延遲。 * Multi-level caches 有能力減輕 read disturbance issues，STT-RAM 作為 L2 Cache [Wang et al. (2015)](https://reurl.cc/D632gQ) 提出 Read-And-Restore strategy，當 L1 miss 後，restore buffer 會刷新 the source L2 block，根據觀察，當讀取電流與寫入電流相反時會發生 read disturbance ，因此建立了 two-read sequence 用來偵測 disturbed memory cells ，並修正受干擾的 cells ，這兩種技術主要可優化效能與降低 energy。 * [Li et al. (2014)](https://reurl.cc/A8OZ73)設想將 STT-RAM 跟 SRAM 整合在 one-level cache system ，此系統會根據存取方式來將 data 搬移到適合他的地方(主要是降低 write operation latency )，並且提出 compiler-assisted 來 minimizing the migration overhead。 * [Wang et al. (2015)](https://reurl.cc/D632gQ) 提出將 STT-RAM 作為 scratchpad memory 來應用於 real-time applications。 ##### MRAM in Last-Level Cache (LLC) * LLC　中主要討論 MRAM 的幾乎都是 STT-RAM， [Wu et al. (2009)](https://reurl.cc/WL19vx)作者評估了許多不同的架構，並將這些架構分為兩種類型的hybrid caches * inter-cache level hybrid cache architectures (LHCA) * intra-cache or region-based hybrid cache architecture (RHCA) * LHCA 每個 level 都由只有一種 memory technology 所構成(vertical integration)，RHCA 則相反(horizontal integration)，研究表明與 three-level SRAM cache 相比，LHCA可以將IPC（instructions per cycle）提高7％，在相同的面積限制下，RHCA可使IPC改善12％。 * [Wang et al. (2014)](https://reurl.cc/N6ReNn)在LLC 上利用 detecting the access pattern 嘗試減少 writes 操作。 * [Samavatian et al. (2014)](https://reurl.cc/bREpqy)這篇 GPU 整合 STT-RAM (L2 cache)，他利用 STT-RAM 的特性來去彌補 high energy and latency of write operations。 #### 4.2.2 MRAM in Main Memory [Kultursay et al. (2013) paper](https://reurl.cc/14mRdG) [Kultursay et al. (2013) ppt](https://reurl.cc/Z7bx0Q) 這篇好像很厲害，佔蠻大版面的，作者提出的方式直接將，STT-RAM 完全替換掉了 main memory，DRAM 兩個缺點: scalability 跟太過耗電。 * 作者先將 DRAM 與 non-optimized STT-RAM 做比較的結果是 DRAM 還是比較好，這是因為 STT-RAM 的問題在於 write operation latency and energy consumption ，之後他提出兩個方式優化 1. decoupled structure of sense amplifiers and row buffers(它們是用來選擇要寫在哪個 memory cells)，這樣可以 reducing the number of write operations，thus the latency and energy! 2. 在 write operations 直接繞過 row buffer，會這樣做主要是因為寫入操作的命中次數遠低於讀取操作。這兩個方式去優化後，在給定數量的 workloads 工作負載和 multicore processor 上進行這種優化的結果表明，STT-RAM存儲器在實現與DRAM相當的性能的同時，可顯著減少能耗（約60％）。 * 另外 [Yang et al. (2013)]() 這篇解法有就是跟 DRAM 整合在水平，並將常寫入的 data 放置於 DRAM 來 reduce energy consumption。 * 雖然 STT-RAM 已經有方法可以取代 DRAM 但 MRAM chip is not compatible with current DRAM interfaces，所以 [Wang et al. (2014)]() 作者提出設計 LPDDR3 compatible with MRAM architecture。 * [Jin et al. (2014)]() 告知我們要注意 reader to the size of an STT-RAM cell ，這可能會成為阻礙發展 density，並且 write power of STT-RAM 對許多應用來說太大了，作者提出修改 some critical parameters(e.g., thermal stability factor、critical current、retention time)，來減輕 STT-RAM’s cell energy，這樣更適合 replacement for DRAM。 #### 4.2.3 MRAM in Storage Systems investigated both PCM and STT-RAM integration in storage systems [Lee et al. (2014)]()，這項研究通過集成這些 NVM 來強調 kernel I/O software stack 的效果，此外，它們還分析跟比較kernel mechanisms (such as read-ahead) and access methods (such as direct and synchronous I/Os) 的效率。 [Kang et al. (2015)](https://reurl.cc/Z718RW)，這篇主要研究垂直整合 NVM 該研究集中在PCM上，但是他們的技術也可以在 STT-RAM 上工作，主要想法是 MRAM 在 storage systems 可以做為 cache，但 cache 中的大多數數據在輸入到cache 後的短時間內就會被置換掉，所以它們提出 use retention relaxation 讓資料多保留在 NVM。 ### 4.3 Open Research Questions #### 4.3.1 Relaxing Non-volatility * (Smullen et al. 2011; Jog et al. 2012; Sun et al. 2011; Guo et al. 2010)，這裡的想法是使用具有不同 retention periods 的 STT-RAM 不同區域（horizontal integration）或級別（vertical integration），從而導致不同的寫入延遲和能耗特性。 #### 4.3.2 Minimizing or Avoiding Writes * 主要想法是 reduce the number of bits updated on the STT-RAM 這可以利用在寫 data 之前先讀，然後針對不同的地方來修改，在更新之前使用 cache 執行比較。 #### 4.3.3 Improving the Memory Lifetime * Improving the memory lifetime by using wear-leveling:使用損耗均衡技術來延長內存壽命，同時應對工作負載的時間和空間局部性。 #### 4.3.4 Addressing 0/1 Asymmetry * 由於MTJ切換時間的不對稱，執行1到0轉換所需的時間大於執行0到1轉換所需的時間，因此在操作之前將所有單元預置為0可以減少有效寫入延遲。