Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

# [Paper] Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller ###### tags: `research-GraphRC` [TOC] ## What is NNAMC ? - NNAMC (Neural Network Accelerator Memory Controller) is the **DRAM controller** proposed in this paper attempt to improve the accelerator (precisely, **CNN accelerator**) performance by targeting the following metrics: - ==Increase the **row buffer hit ratio**== - ==Reduce the **access latency**== - NNAMC achieved better performance by - Dividing the memory **banks** and applies different **address mapping schemes** to different banks - Find optimal **address mapping scheme** by based on previous access address ## Why we need a memory controller for AI accelerator ? - The utilization of DRAM is related to its **access pattern** - CNN computation generates **complicated DRAM access pattern** (i.e. conv, pooling, fully-connected) - Previous memory controllers convert physical addresses using a **single address mapping scheme**, which is NOT flexible enough for CNN accelerator ## Previous work - Previous memory contorller focus on: - Memory address **mapping** - Memory address **scheduling** - **Rearrangement** of access data - Memory address mapping dispatchs memory access to physical storage units based on the arrangement of **bank ID, row ID, column ID** - Predefined opt techniques perform poorly because access patterns are **dynamically generated** ### Commonly used address mapping - ![](https://i.imgur.com/Nbwm2OP.png) ## DRAM characteristic - ![](https://i.imgur.com/zWj9TUV.png) - Structure - Vertically, each cell is connected to a **row buffer (pink)** through a shared **bit line (blue)** - Horizontally, each cell is connected to a **row decoder (black)** through a shared **word line (red)** - ==**Row buffer** is used as a cache for storing row data accessed by the previous address== - Bank hit & conflict - ==**Row buffer hit**==: If two consecutive accesses refer to **the same row of the same bank** in a DRAM, the access latency will be low - ==**Row buffer conflict**==: When user accesses **different rows in the same bank** of a DRAM, the previously accessed row must be **precharged** and the target access row must be **activated**. Such operation requires many cycles - The latency of row conflict is **3.5x** of row buffer hit (DDR3-1600) - Row buffer hit: **13.125 ns** - Row buffer miss: **48.125 ns** ### Solving row conflict - Solving row conflict **in the same bank** - Current row must be **closed** by a **precharge** command - Issue an **activate** command to read another row ### Avoid row conflict - Avoid row conflict by placing data **in diff banks** - Two accesses in **different rows of different banks** will avoid row conflict ## Example of a BAD address mapping scheme - ![](https://i.imgur.com/FqsnFHU.jpg) ### Assumption - Each pixel is **16 bits** - Input image is of size 80x60 pixels - The address show in the picture is **bit-address** - Use a **BRC** address mapping scheme ### BRC + CNN example - ![](https://i.imgur.com/PdHi5r1.png) - Access pixel (79,59) in this image - Convert (79,59) to memory address **0x126F0** - Apply BRC address mapping scheme - | Bank ID | Row ID | Column ID | | ------- | ------ | --------- | | [27:25] | [24:10]| [9:0] | | 0x0 | 0x49 | 0x2F0 | - This address is mapped to **0x0 bank, 0x49 row, 0x2F0 column** ### Why BRC is bad for CNN ? - ![](https://i.imgur.com/aTUp0XX.png) - | Input idx | Address | Bank ID | Row ID | Column ID| Conflict | | ---------- | ------- | ------- | ------ | -------- | -------- | | (6,5) | 0x01450 | 0x0 | 0x5 | 0x050 | * | (5,3) | 0x00A40 | 0x0 | 0x2 | 0x240 | * | (6,3) | 0x00A50 | 0x0 | 0x2 | 0x250 | | (7,3) | 0x00A60 | 0x0 | 0x2 | 0x260 | x | (5,4) | 0x00F40 | 0x0 | 0x3 | 0x340 | x | (6,4) | 0x00F50 | 0x0 | 0x3 | 0x350 | | (7,4) | 0x00F60 | 0x0 | 0x3 | 0x360 | v | (5,5) | 0x01440 | 0x0 | 0x5 | 0x040 | v | (6,5) | 0x01450 | 0x0 | 0x5 | 0x050 | | (7,5) | 0x01460 | 0x0 | 0x5 | 0x060 | r | (6,3) | 0x00A50 | 0x0 | 0x2 | 0x250 | r - Note: **If the accessed data are NOT in the row buffer of the current bank then there's a conflict** - 4 out of 11 ( 36% ) of memory access results in row conflict - **Naive address mapping scheme like BRC is NOT good enough for CNN** ## Architecture ![](https://i.imgur.com/1nVQSGg.png) ### AABM (Address Access Behavior Monitoring) - Log memory address stream ### SAPU (Stream Access Prediction Unit) - Function - Analyze input address stream - Analyze neural network pattern (i.e. convolution size) - Divedes the memory request into four subtypes: Sequential access, Stride access, 2D access, Random access - Components - **PRT (Parameter Reference Table)** - Stores **address streams, network parameters** (i.e. image length, window size) - Network parameters should be predefined by AI accelerator - **PCU (Parameter Computing Unit)** - Address difference storage unit - Calculate and store `diff_addr = prev_addr - curr_addr` - Parameter threshold calculation unit - Calculates network parameters - `Image_width = Image_width + 2 x padding` - `Image_width` is the current pixel - `Kernel = Kernel_size x Kernel_size` - `M = (Image_width − Kernel_size + 1) × Pixel` - `M` is the current bit address - `N = (Image_height − 1) × Image_width × Pixel` - `N` is the last bit address of an input - **CDL (Comparative Decision Logic)** - Read values from `Dif_ram`, and generates output to AL - Steps: - :::spoiler Original explanation ![](https://i.imgur.com/jKB2Vki.png) ::: - **Combinational Logic A switching to next row of a kernel** - **Combinational Logic B detects sequential access in a row of a kernel** - **Combinational Logic C outputs 1 when memory access is sequential** - **Shift register outputs 1 when done with a kernel** - **AL (Arbitration Logic)** - S0, S1, and S2 provide binary-number combinations on the **four output results (Y0-Y3)**, corresponding to **100, 010, and 001** - `100`: 2D mem access - `010`: ? - `001`: ? - A better solution: A FSM designed to catch 3x3 conv kernel 2D mem pattern ```graphviz digraph finite_state_machine { rankdir=LR; size="8,5" node [shape = doublecircle]; Seq, Two_D, Random, Stride; node [shape = circle]; INIT -> Row0; Row0 -> Row1 [ label = <diff % M == 0> ]; Row0 -> Seq [ label = <diff_addr the same> ]; Row0 -> Row0; Row1 -> Row2 [ label = <diff % M == 0> ]; Row1 -> Row1; Row2 -> Row0 [ label = <diff % M == 0> ]; Row2 -> Row2; Row0 -> Two_D [ label = <done = 1> ]; Row0 -> Random[ label = <non-seq> ]; Row1 -> Random[ label = <non-seq> ]; Row2 -> Random[ label = <non-seq> ]; } ``` ### An example - ![](https://i.imgur.com/h5Isj65.png) ### BPM (Bank Partitioning Model) #### Why BPM ? - When the accelerator simultaneously runs **multiple memory access streams**, **row conflicts** and **memory access latency** can reduce the system performance - ![](https://i.imgur.com/n5UBI1a.png) - LHS: Single address mapping for **ALL** CNN mem access - RHS: Partition DRAM banks, each for diff address mapping - **Bank retag**: Rearranges the bank ID of an address - Detail - ![](https://i.imgur.com/6ir1LuG.png) - Advantage - Isolate and reduce the interference between memory access streams - Each bank has its OWN address mapping scheme - Fully utilize memory space (?) - Disadvantage - Potential loss / conflict of data - The probability of data loss minimally impacts on neural networks #### Result - **BPBI** address mapping scheme for **stride** access - **RBC** address mapping scheme for **sequential** and **2D** access - **Random** access requires NO address mapping scheme ## Experiment ### Platform - Xilinx VC707 FPGA - 1 GB Micron DDR3 DRAM - Physical address contains 27 bits - 3 bits bank coordinates - 14 bits row coordinates - 10 bits column coordinates - Based on a Xilinx **MIG IP Core** and implemented in Verilog - The system voltage is **1.8 V** and the system frequency is **200 MHz** ### Metric - **Row buffer hit ratio** and **access latency** were selected as performance measures - ![](https://i.imgur.com/xOdunvK.png) - ![](https://i.imgur.com/3qVefLy.png) ### NNAMC's perf in various network parameters - No matter how the image sizes, pixel sizes and convolution kernel sizes of the network parameters change, NNAMC can still reflect a **high memory access performance** - ![](https://i.imgur.com/89AKh3c.png) ## Term definition - NNAMC: Neural Network Accelerator Memory Controller - SAPU: Stream Access Prediction Unit - BPM: Bank Partitioning Model - BRC: Bank-Row-Column - RBC: Row-Bank-Column - MIG: Xilinx Memory Interface Generator - PBPI: Permutation-Based Page Interleaving - APT: Address and Pixel Transaction - AABM: Address Access Behavior Monitoring - PRT: Parameter Reference Table - PCU: Parameter Computing Unit - CDL: Comparative Decision Logic - AL: Arbitration Logic