# [Paper] Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller
###### tags: `research-GraphRC`
[TOC]
## What is NNAMC ?
- NNAMC (Neural Network Accelerator Memory Controller) is the **DRAM controller** proposed in this paper attempt to improve the accelerator (precisely, **CNN accelerator**) performance by targeting the following metrics:
- ==Increase the **row buffer hit ratio**==
- ==Reduce the **access latency**==
- NNAMC achieved better performance by
- Dividing the memory **banks** and applies different **address mapping schemes** to different banks
- Find optimal **address mapping scheme** by based on previous access address
## Why we need a memory controller for AI accelerator ?
- The utilization of DRAM is related to its **access pattern**
- CNN computation generates **complicated DRAM access pattern** (i.e. conv, pooling, fully-connected)
- Previous memory controllers convert physical addresses using a **single address mapping scheme**, which is NOT flexible enough for CNN accelerator
## Previous work
- Previous memory contorller focus on:
- Memory address **mapping**
- Memory address **scheduling**
- **Rearrangement** of access data
- Memory address mapping dispatchs memory access to physical storage units based on the arrangement of **bank ID, row ID, column ID**
- Predefined opt techniques perform poorly because access patterns are **dynamically generated**
### Commonly used address mapping
- 
## DRAM characteristic
- 
- Structure
- Vertically, each cell is connected to a **row buffer (pink)** through a shared **bit line (blue)**
- Horizontally, each cell is connected to a **row decoder (black)** through a shared **word line (red)**
- ==**Row buffer** is used as a cache for storing row data accessed by the previous address==
- Bank hit & conflict
- ==**Row buffer hit**==: If two consecutive accesses refer to **the same row of the same bank** in a DRAM, the access latency will be low
- ==**Row buffer conflict**==: When user accesses **different rows in the same bank** of a DRAM, the previously accessed row must be **precharged** and the target access row must be **activated**. Such operation requires many cycles
- The latency of row conflict is **3.5x** of row buffer hit (DDR3-1600)
- Row buffer hit: **13.125 ns**
- Row buffer miss: **48.125 ns**
### Solving row conflict
- Solving row conflict **in the same bank**
- Current row must be **closed** by a **precharge** command
- Issue an **activate** command to read another row
### Avoid row conflict
- Avoid row conflict by placing data **in diff banks**
- Two accesses in **different rows of different banks** will avoid row conflict
## Example of a BAD address mapping scheme
- 
### Assumption
- Each pixel is **16 bits**
- Input image is of size 80x60 pixels
- The address show in the picture is **bit-address**
- Use a **BRC** address mapping scheme
### BRC + CNN example
- 
- Access pixel (79,59) in this image
- Convert (79,59) to memory address **0x126F0**
- Apply BRC address mapping scheme
- | Bank ID | Row ID | Column ID |
| ------- | ------ | --------- |
| [27:25] | [24:10]| [9:0] |
| 0x0 | 0x49 | 0x2F0 |
- This address is mapped to **0x0 bank, 0x49 row, 0x2F0 column**
### Why BRC is bad for CNN ?
- 
- | Input idx | Address | Bank ID | Row ID | Column ID| Conflict |
| ---------- | ------- | ------- | ------ | -------- | -------- |
| (6,5) | 0x01450 | 0x0 | 0x5 | 0x050 | *
| (5,3) | 0x00A40 | 0x0 | 0x2 | 0x240 | *
| (6,3) | 0x00A50 | 0x0 | 0x2 | 0x250 |
| (7,3) | 0x00A60 | 0x0 | 0x2 | 0x260 | x
| (5,4) | 0x00F40 | 0x0 | 0x3 | 0x340 | x
| (6,4) | 0x00F50 | 0x0 | 0x3 | 0x350 |
| (7,4) | 0x00F60 | 0x0 | 0x3 | 0x360 | v
| (5,5) | 0x01440 | 0x0 | 0x5 | 0x040 | v
| (6,5) | 0x01450 | 0x0 | 0x5 | 0x050 |
| (7,5) | 0x01460 | 0x0 | 0x5 | 0x060 | r
| (6,3) | 0x00A50 | 0x0 | 0x2 | 0x250 | r
- Note: **If the accessed data are NOT in the row buffer of the current bank then there's a conflict**
- 4 out of 11 ( 36% ) of memory access results in row conflict
- **Naive address mapping scheme like BRC is NOT good enough for CNN**
## Architecture

### AABM (Address Access Behavior Monitoring)
- Log memory address stream
### SAPU (Stream Access Prediction Unit)
- Function
- Analyze input address stream
- Analyze neural network pattern (i.e. convolution size)
- Divedes the memory request into four subtypes: Sequential access, Stride access, 2D access, Random access
- Components
- **PRT (Parameter Reference Table)**
- Stores **address streams, network parameters** (i.e. image length, window size)
- Network parameters should be predefined by AI accelerator
- **PCU (Parameter Computing Unit)**
- Address difference storage unit
- Calculate and store `diff_addr = prev_addr - curr_addr`
- Parameter threshold calculation unit
- Calculates network parameters
- `Image_width = Image_width + 2 x padding`
- `Image_width` is the current pixel
- `Kernel = Kernel_size x Kernel_size`
- `M = (Image_width − Kernel_size + 1) × Pixel`
- `M` is the current bit address
- `N = (Image_height − 1) × Image_width × Pixel`
- `N` is the last bit address of an input
- **CDL (Comparative Decision Logic)**
- Read values from `Dif_ram`, and generates output to AL
- Steps:
- :::spoiler Original explanation

:::
- **Combinational Logic A switching to next row of a kernel**
- **Combinational Logic B detects sequential access in a row of a kernel**
- **Combinational Logic C outputs 1 when memory access is sequential**
- **Shift register outputs 1 when done with a kernel**
- **AL (Arbitration Logic)**
- S0, S1, and S2 provide binary-number combinations on the **four output results (Y0-Y3)**, corresponding to **100, 010, and 001**
- `100`: 2D mem access
- `010`: ?
- `001`: ?
- A better solution: A FSM designed to catch 3x3 conv kernel 2D mem pattern
```graphviz
digraph finite_state_machine {
rankdir=LR;
size="8,5"
node [shape = doublecircle]; Seq, Two_D, Random, Stride;
node [shape = circle];
INIT -> Row0;
Row0 -> Row1 [ label = <<font color="blue">diff % M == 0</font>> ];
Row0 -> Seq [ label = <<font color="purple">diff_addr the same</font>> ];
Row0 -> Row0;
Row1 -> Row2 [ label = <<font color="blue">diff % M == 0</font>> ];
Row1 -> Row1;
Row2 -> Row0 [ label = <<font color="blue">diff % M == 0</font>> ];
Row2 -> Row2;
Row0 -> Two_D [ label = <<font color="green">done = 1</font>> ];
Row0 -> Random[ label = <<font color="red">non-seq</font>> ];
Row1 -> Random[ label = <<font color="red">non-seq</font>> ];
Row2 -> Random[ label = <<font color="red">non-seq</font>> ];
}
```
### An example
- 
### BPM (Bank Partitioning Model)
#### Why BPM ?
- When the accelerator simultaneously runs **multiple memory access streams**, **row conflicts** and **memory access latency** can reduce the system performance
- 
- LHS: Single address mapping for **ALL** CNN mem access
- RHS: Partition DRAM banks, each for diff address mapping
- **Bank retag**: Rearranges the bank ID of an address
- Detail
- 
- Advantage
- Isolate and reduce the interference between memory access streams
- Each bank has its OWN address mapping scheme
- Fully utilize memory space (?)
- Disadvantage
- Potential loss / conflict of data
- The probability of data loss minimally impacts on neural networks
#### Result
- **BPBI** address mapping scheme for **stride** access
- **RBC** address mapping scheme for **sequential** and **2D** access
- **Random** access requires NO address mapping scheme
## Experiment
### Platform
- Xilinx VC707 FPGA
- 1 GB Micron DDR3 DRAM
- Physical address contains 27 bits
- 3 bits bank coordinates
- 14 bits row coordinates
- 10 bits column coordinates
- Based on a Xilinx **MIG IP Core** and implemented in Verilog
- The system voltage is **1.8 V** and the system frequency is **200 MHz**
### Metric
- **Row buffer hit ratio** and **access latency** were selected as performance measures
- 
- 
### NNAMC's perf in various network parameters
- No matter how the image sizes, pixel sizes and convolution kernel sizes of the network parameters change, NNAMC can still reflect a **high memory access performance**
- 
## Term definition
- NNAMC: Neural Network Accelerator Memory Controller
- SAPU: Stream Access Prediction Unit
- BPM: Bank Partitioning Model
- BRC: Bank-Row-Column
- RBC: Row-Bank-Column
- MIG: Xilinx Memory Interface Generator
- PBPI: Permutation-Based Page Interleaving
- APT: Address and Pixel Transaction
- AABM: Address Access Behavior Monitoring
- PRT: Parameter Reference Table
- PCU: Parameter Computing Unit
- CDL: Comparative Decision Logic
- AL: Arbitration Logic