# [Paper] RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses
###### tags: `research-GraphRC`
## Slides
[link](https://docs.google.com/presentation/d/1Nod-33jaI3vAFOe47u_qGkhPJmi6bO7PC4ZuV98r6m4/edit?usp=sharing)
## What is NVM ?
- Adv: non-volatility, high storage density, and low standby power
- Most NVMs are based on simple two-terminal switching elements. i.e. ReRAM, RRAM, MRAM, PCM, 3D XPoint
- Possible substitute for DRAM
## DB Access pattern
- OLTP (On-Line Transactional Processing)
- ==Row memory access==
- Mix of reads and writes to a **few rows** at a time, which are often **latency-critical**
- For **transactional processing**
- OLAP (On-Line Analytical Processing)
- ==Column memory access==
- Bulk sequential scans spanning a **few columns**, such as computing the **sum of a specific column**
- For **data warehouse** system
- OLXP (Hybrid)
- IMDB made it possible to process OLXP DB access pattern
- SQL example
- 
## What is IMDB ?
- IMDB ( In-Memory Database ) is a database system that stores a significant part, if not the entirety, of data in **main memory** to achieve high performance.
- Explicit separation between OLTP and OLAP suffers from
- Waste of memory (double copy)
- Low data freshness for real-time (data consistency)
## Solution to OLXP access pattern (previous work)
- SW solution
- Hybrid layout: Hot tuple in row-major (for OLTP), cold tuple in column-major (for OLAP) layout
- Disadv: **Overhead** needed to transfer tuples between two different formats and maintain the **consistency** of the whole table
- HW solution
- **GS-DRAM** accelerate strided accesses by allowing the memory controller to access multiple values of a strided access pattern from different chips with a single read/write command
- Disadv: not flexible
- **RC-DRAM** (Dual-addressing DRAM) add 1 more transistor to traditional DRAM
- Disadv: Area overhead
- **Ideal solution** Crossbar arch NVM, support RC access, high density
## RC-NVM architecture
** I made a slide to clearly demonstrate the RC-NVM architecture. You can find it [here](https://docs.google.com/presentation/d/1Nod-33jaI3vAFOe47u_qGkhPJmi6bO7PC4ZuV98r6m4/edit?usp=sharing)
- This is a **ITIR crossbar NVM** architecture with the following advantages
- High integration density
- 3D stacking
- Symmetry of the crossbar used to implement **dual-addressing memories** 
### Hardware overview (assumption)
Note that this is the assumed hardware architecture of RC-NVM based on the paper. This configuration is based on MY OWN assumption and might not aligned with the original design.
- 1 bit for Channel
- 2 Channels
- 2 bits for Rank
- 4 Ranks in 1 Channel
- 3 bits for Bank
- 8 Banks in 1 Rank
- 3 bits for Subarray
- 8 Subarray in 1 Bank
- 10 bits for Row, 10 bits for Column
- 1024 Mats in a Subarray
- 3 bits for Bytes
- 8 Bytes in 1 Mat
### RC-NVM Bank
- > A set of adjacent mats in a bank is organized into a single entity called **subarray**
- > A bank contains multiple subarrays which can provide some parallelism within the same bank, especially in RC-NVM based memory.
- > A **logic subarray** in the **logic bank** is the basic access unit of both row-oriented and column-oriented accesses.
- 
### RC-NVM Logical Bank
- > A **logic bank** consists of a set of banks from every chip **in a rank**, which are **operated in lockstep**
- 
- The minimum **directionless granularity is 8 bytes**, which are exactly the data transfered synonymously on a **64-bit memory bus**
- This **8 bytes** comes from **8 chips**, each of which provides **8 bits** from eight independent **mats** located in a **subarray**
- Then, a typical row-oriented (column-oriented) 64-byte cache line is composed of eight 8-byte units on a row (column) in a logic sub-array.
- The data on the **intersection of the row and column** may be duplicated in both row and column buffer (**coherence issue**)
- Restrict that the row and column buffer **cannot be activated for the same subarray** simultaneously
- If a **switch** between row and column accesses **in a subarray** occurs
- Controller needs to first precharge the active buffer and **flush the data back**, and then activates the other buffer
### RC-NVM Chip
- 
- Considering adjacent data tend to have the **same access pattern** in terms of orientation, we arrange two banks in a **diagonal** successively in the address space
- Buffer conflict between banks can be minimized to guarantee the bank parallelism
### Modern DRAM architecture
- 
- Detail [slides](http://www.softnology.biz/pdf/lecture05-dram.pdf)
#### Hierarchy
- Channel: Single/Dual memory channel
- DIMM (Dual in-line module)
- ==Rank==: rank 0 (front), rank 1(back)
- 1 rank delivers **8 bytes**
- 
- Chip: 1 rank = 8 chips
- Each chip delivered **1 byte**
- ==Bank==: 1 chip = 8 banks
- Each bank delivered **1 byte**
- Subarray: Mats that shared GWL
- Mat:
- 
#### Diagram
- 
## RC-NVM request scheduling
- To exploit extra column buffers, **different subarrays in a bank can be activated simultaneously**
- Avoid issuing row and column accesses to the **same subarray simultaneously** to avoid data synonym problem
- Typically, each **bank** has a separated queue
- 
- The scheduler consists of two level arbiters
- Lv1: Activation arbiter
- Lv1: Read/write arbiter
- Shared by all banks
- Lv2: Address arbiter
- Grants the shared address resources
- Lv1 arbiter: Ordered scheduling policy
- Lv2 arbiter: Prioritizes operations from read/write arbiter over others
- 
- Instead of one queue per bank, a set of banks **sharing row/column buffers** are grouped together to coordinate requests scheduling
- **Four adjacent banks** share a requests queue
- To support RC-NVM requests scheduling, three new fields, **orientation (O), bank ID (Bank) and subarray ID (Subary)**, are added in queues
- (O): row / column access
- (Bank): Bank ID is used to identify a bank since a schedule queue is shared by 4 adj. banks
- (Subarray): Two requests come to the same subarray, they have to be served sequentially, which is called **subarray conflicts**
- Treat row and column accesses to the **same subarray** as **regular subarray conflicts**
- **Data synonym problem** in a subarray can be solved by serializing all row- and column-oriented accesses to the same subarray
## RC-NVM address space
- Traditional DRAM is 1D, and normal computer address space is 1D
- RC-NVM is 2D, need to be mapped to 1D computer address space
- 
- For the same data (location) in RC-NVM, the only difference is the order of the row bits and column bits in the total 32-bit address.
- When the row-oriented address is increased, the column bit is increased. It represents the case of scanning on a physical row. Similarly, for a column- oriented address, increasing the address represents the case of scanning a column
## Page table
- Huge-page technique
- **1 GB page size**
- Lower 30-bit VA -> PA
- Reduce TLB miss
- The basic access unit is a **subarray** for both row-oriented and column-oriented accesses
- As long as the subarray bits, which include the row and column bits, are allocated inside the **30 least significant bits**, the application can always explicitly control the data to be accessed
- 
## Cache architecture
- **Cache synonym**: Different VAs can map to the same PA
- Different virtual pages can share the same physical frame
- Shared libraries, copy-on-write
- Dual Addresses: Data in RC-NVM can be accessed with two different addresses, two copies of **every 8 bytes** may exist in the cache simultaneously
- 
- 1 extra orientation bit per cache line is added
### Cache synonym in a single-core processor
- We add one **extra status bit** for each 8 bytes (i.e., the granularity of data synonym) to indicate whether its duplicated copy is cached
- For a **64-byte cache block**, **8 extra status bits** are needed, called **crossing bits**
- Methodology: Keep duplicated data updated at the same time. Extra operations are required
- Steps:
- When a cache block is loaded into the cache, the cache controller needs to check all potential cache blocks that may cross with this one whether exist in the cache
- When a row-oriented cache block is loaded into the cache, 8 column-oriented cache blocks are checked
### Cache synonym and coherence in a multi-core processor
- ==Cache synonym is always solved first, then cache coherence protocols are applied==
- Example
- Whenever a **write** operation happens, the **crossed cache blocks are updated** accordingly (cache synonym solved)
- After that, cache coherence protocols start to work to keep consistent in multiple cores and memory levels
- Note that the ==cache coherence operations only involve the cache blocks in the same address space== (either row-oriented or column-oriented)
- We can find that in the proposed cache synonym solving protocol, there is no extra overhead for a cache read operation. **The overhead of a write operation is moderate**
- 
### Parallel check mechanism
- 
- Note that the HighR and HighC parts are swapped in column-oriented address map- ping to keep the same order with row-oriented address mapping
- With this address mapping scheme, **all potential blocks that may cross with a particular cache block will be cached in the same set due to the same index**
- 