RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses

# [Paper] RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses ###### tags: `research-GraphRC` ## Slides [link](https://docs.google.com/presentation/d/1Nod-33jaI3vAFOe47u_qGkhPJmi6bO7PC4ZuV98r6m4/edit?usp=sharing) ## What is NVM ? - Adv: non-volatility, high storage density, and low standby power - Most NVMs are based on simple two-terminal switching elements. i.e. ReRAM, RRAM, MRAM, PCM, 3D XPoint - Possible substitute for DRAM ## DB Access pattern - OLTP (On-Line Transactional Processing) - ==Row memory access== - Mix of reads and writes to a **few rows** at a time, which are often **latency-critical** - For **transactional processing** - OLAP (On-Line Analytical Processing) - ==Column memory access== - Bulk sequential scans spanning a **few columns**, such as computing the **sum of a specific column** - For **data warehouse** system - OLXP (Hybrid) - IMDB made it possible to process OLXP DB access pattern - SQL example - ![](https://i.imgur.com/cB6dDGR.png) ## What is IMDB ? - IMDB ( In-Memory Database ) is a database system that stores a significant part, if not the entirety, of data in **main memory** to achieve high performance. - Explicit separation between OLTP and OLAP suffers from - Waste of memory (double copy) - Low data freshness for real-time (data consistency) ## Solution to OLXP access pattern (previous work) - SW solution - Hybrid layout: Hot tuple in row-major (for OLTP), cold tuple in column-major (for OLAP) layout - Disadv: **Overhead** needed to transfer tuples between two different formats and maintain the **consistency** of the whole table - HW solution - **GS-DRAM** accelerate strided accesses by allowing the memory controller to access multiple values of a strided access pattern from different chips with a single read/write command - Disadv: not flexible - **RC-DRAM** (Dual-addressing DRAM) add 1 more transistor to traditional DRAM - Disadv: Area overhead - **Ideal solution** Crossbar arch NVM, support RC access, high density ## RC-NVM architecture ** I made a slide to clearly demonstrate the RC-NVM architecture. You can find it [here](https://docs.google.com/presentation/d/1Nod-33jaI3vAFOe47u_qGkhPJmi6bO7PC4ZuV98r6m4/edit?usp=sharing) - This is a **ITIR crossbar NVM** architecture with the following advantages - High integration density - 3D stacking - Symmetry of the crossbar used to implement **dual-addressing memories** ![](https://i.imgur.com/ZFFB3Ca.png) ### Hardware overview (assumption) Note that this is the assumed hardware architecture of RC-NVM based on the paper. This configuration is based on MY OWN assumption and might not aligned with the original design. - 1 bit for Channel - 2 Channels - 2 bits for Rank - 4 Ranks in 1 Channel - 3 bits for Bank - 8 Banks in 1 Rank - 3 bits for Subarray - 8 Subarray in 1 Bank - 10 bits for Row, 10 bits for Column - 1024 Mats in a Subarray - 3 bits for Bytes - 8 Bytes in 1 Mat ### RC-NVM Bank - > A set of adjacent mats in a bank is organized into a single entity called **subarray** - > A bank contains multiple subarrays which can provide some parallelism within the same bank, especially in RC-NVM based memory. - > A **logic subarray** in the **logic bank** is the basic access unit of both row-oriented and column-oriented accesses. - ![](https://i.imgur.com/TtPQMVm.png) ### RC-NVM Logical Bank - > A **logic bank** consists of a set of banks from every chip **in a rank**, which are **operated in lockstep** - ![](https://i.imgur.com/I43jamt.png) - The minimum **directionless granularity is 8 bytes**, which are exactly the data transfered synonymously on a **64-bit memory bus** - This **8 bytes** comes from **8 chips**, each of which provides **8 bits** from eight independent **mats** located in a **subarray** - Then, a typical row-oriented (column-oriented) 64-byte cache line is composed of eight 8-byte units on a row (column) in a logic sub-array. - The data on the **intersection of the row and column** may be duplicated in both row and column buffer (**coherence issue**) - Restrict that the row and column buffer **cannot be activated for the same subarray** simultaneously - If a **switch** between row and column accesses **in a subarray** occurs - Controller needs to first precharge the active buffer and **flush the data back**, and then activates the other buffer ### RC-NVM Chip - ![](https://i.imgur.com/T3Gkocw.png) - Considering adjacent data tend to have the **same access pattern** in terms of orientation, we arrange two banks in a **diagonal** successively in the address space - Buffer conflict between banks can be minimized to guarantee the bank parallelism ### Modern DRAM architecture - ![](https://i.imgur.com/mzsQULb.png) - Detail [slides](http://www.softnology.biz/pdf/lecture05-dram.pdf) #### Hierarchy - Channel: Single/Dual memory channel - DIMM (Dual in-line module) - ==Rank==: rank 0 (front), rank 1(back) - 1 rank delivers **8 bytes** - ![](https://i.imgur.com/JKviJVM.png) - Chip: 1 rank = 8 chips - Each chip delivered **1 byte** - ==Bank==: 1 chip = 8 banks - Each bank delivered **1 byte** - Subarray: Mats that shared GWL - Mat: - ![](https://i.imgur.com/oQ1iZaR.png) #### Diagram - ![](https://i.imgur.com/H9T2eMo.png) ## RC-NVM request scheduling - To exploit extra column buffers, **different subarrays in a bank can be activated simultaneously** - Avoid issuing row and column accesses to the **same subarray simultaneously** to avoid data synonym problem - Typically, each **bank** has a separated queue - ![](https://i.imgur.com/0R7Mdzh.png) - The scheduler consists of two level arbiters - Lv1: Activation arbiter - Lv1: Read/write arbiter - Shared by all banks - Lv2: Address arbiter - Grants the shared address resources - Lv1 arbiter: Ordered scheduling policy - Lv2 arbiter: Prioritizes operations from read/write arbiter over others - ![](https://i.imgur.com/T3Gkocw.png) - Instead of one queue per bank, a set of banks **sharing row/column buffers** are grouped together to coordinate requests scheduling - **Four adjacent banks** share a requests queue - To support RC-NVM requests scheduling, three new fields, **orientation (O), bank ID (Bank) and subarray ID (Subary)**, are added in queues - (O): row / column access - (Bank): Bank ID is used to identify a bank since a schedule queue is shared by 4 adj. banks - (Subarray): Two requests come to the same subarray, they have to be served sequentially, which is called **subarray conflicts** - Treat row and column accesses to the **same subarray** as **regular subarray conflicts** - **Data synonym problem** in a subarray can be solved by serializing all row- and column-oriented accesses to the same subarray ## RC-NVM address space - Traditional DRAM is 1D, and normal computer address space is 1D - RC-NVM is 2D, need to be mapped to 1D computer address space - ![](https://i.imgur.com/2Y9voJr.png) - For the same data (location) in RC-NVM, the only difference is the order of the row bits and column bits in the total 32-bit address. - When the row-oriented address is increased, the column bit is increased. It represents the case of scanning on a physical row. Similarly, for a column- oriented address, increasing the address represents the case of scanning a column ## Page table - Huge-page technique - **1 GB page size** - Lower 30-bit VA -> PA - Reduce TLB miss - The basic access unit is a **subarray** for both row-oriented and column-oriented accesses - As long as the subarray bits, which include the row and column bits, are allocated inside the **30 least significant bits**, the application can always explicitly control the data to be accessed - ![](https://i.imgur.com/6alZlWS.png) ## Cache architecture - **Cache synonym**: Different VAs can map to the same PA - Different virtual pages can share the same physical frame - Shared libraries, copy-on-write - Dual Addresses: Data in RC-NVM can be accessed with two different addresses, two copies of **every 8 bytes** may exist in the cache simultaneously - ![](https://i.imgur.com/rz9fxJm.png) - 1 extra orientation bit per cache line is added ### Cache synonym in a single-core processor - We add one **extra status bit** for each 8 bytes (i.e., the granularity of data synonym) to indicate whether its duplicated copy is cached - For a **64-byte cache block**, **8 extra status bits** are needed, called **crossing bits** - Methodology: Keep duplicated data updated at the same time. Extra operations are required - Steps: - When a cache block is loaded into the cache, the cache controller needs to check all potential cache blocks that may cross with this one whether exist in the cache - When a row-oriented cache block is loaded into the cache, 8 column-oriented cache blocks are checked ### Cache synonym and coherence in a multi-core processor - ==Cache synonym is always solved first, then cache coherence protocols are applied== - Example - Whenever a **write** operation happens, the **crossed cache blocks are updated** accordingly (cache synonym solved) - After that, cache coherence protocols start to work to keep consistent in multiple cores and memory levels - Note that the ==cache coherence operations only involve the cache blocks in the same address space== (either row-oriented or column-oriented) - We can find that in the proposed cache synonym solving protocol, there is no extra overhead for a cache read operation. **The overhead of a write operation is moderate** - ![](https://i.imgur.com/rz9fxJm.png) ### Parallel check mechanism - ![](https://i.imgur.com/vUXs3hS.png) - Note that the HighR and HighC parts are swapped in column-oriented address map- ping to keep the same order with row-oriented address mapping - With this address mapping scheme, **all potential blocks that may cross with a particular cache block will be cached in the same set due to the same index** - ![](https://i.imgur.com/sY0CW9C.png)