# Fused-Layer CNN Accelerators
###### tags: `Accelerators`
###### members: @Mickeyyayyaya
## Abstract
* Accelerators for efficiently evaluating CNNs are rapidly growing in popularity.
* The conventional approaches to designing such CNN accelerators is to focus on creating accelerators to iteratively process the CNN layers.
### Issue
* By processing each layer to completion, the accelerator designs must use off-chip memory to store intermediate data between layers, because the <font color= red >intermediate data are too large to fit on chip</font>.
## Introduction
### 1. Significant interest in developing and adapting hardware accelerators for CNN accerlerators
* GPUs, FPGAs, and ASICs
### 2. Sheer volume of operations precludes a dataflow implementation
* Even for a single layer
### 3. Current approach( at the time )
* Evaluate the network by following its structure, one layer at a time.
* Intermediate data are streamed back to the same compute units, repeating the process until all layers have been evaluated.
### 4. Issue
:::danger
* Wasted data movement
:::
## Designing CNN accelerator

* N-channels of R*C value
* Convolved with M-sets of N*k*k
* Output feature maps then undergo some kind of non-linear operations
* E.g. ReLU
* Optionally
* Increasing the depth (number oflayers) of a network yield higher recognition accuracy
* 2012 AlexNet: 5
* 2014 VGG: 17
* 2014 GooLeNet: 20
### A. Hardware Accelerators
For each layer, input feature maps and filter weights are brought from off-chip DRAM into local buffers, the convolutions are performed, and output feature map data are written into DRAM.
::: danger
The large volume of data comprising the feature maps stresses the memory system and can become the bottleneck.
:::
This has inspired efforts to optimize the memory accesses patterns for this layer-by-layer approach.
### B. Data Access Patterns

* In the first eight layers, the sum of the inputs and outputsis much higher than the weights.
* Prior work tried solving symptoms(effectively managing the on/off-chip bandwidth)
* But the networks are growing
* 25% of Alexnet's data used is feature maps
* 50% in GoogLenet
* Remove these accesses completely
::: success
In this paper, we demonstrate layer fusion, our technique to minimize the off-chip data movement between layers by re-organizing the evaluation process.
:::
## Fused-layer CNN Accelerators
### Key ideas
* Fuse 2+ convolution layers
* Only the input feature maps of the first fused layer are transferred from DRAM
* Compute the intermediate values of all of the fused layers that depend on the data region
being read
### A. Overview

* Tile the original feature maps into 5 X 5 X N chunks
* Layer 1 convolves with all M-sets of 3 X 3 X N filters
* Layer 2 does the same, but on the 3 X 3 X M values
* Produces 1 X 1 X P results
::: success
No longer have to save the immediate feature map!
**Side-effect:**
1. We don't have to load in much new data as the tile moves
2. Overlapped computation
* Re-compute?
* Cache?
:::
### B. Exploration Framework
::: info
How do you determine how much space is required?
:::
* Work backwards
* Use dimensions of filters/sets/tiles/etc. to quantify
* Similarly, we can compute the cost of the re-computation vs caches
### C. Recomputing vs. Storing
* From fig.3:
* 6M blue(purple) values
* Each required 9N mul and add
* Total = 6M(9+9)N = 108MN
* Used multipled times
* L-R + T-B
* VGGNet-E
* Layer 1: M=64 N=3
* Layer 2: M=64 N=64
* AlexNet
* Fusing the first two layers adds 768M extra mul + add op to avoid off-chip transfers
* Requires only 55.86KB extra storage to avoid this
* VGGNet-E
* Fusing all 19 layers
* 470B extra
* 1.4MB storage
### D. Partitioning Networks for Layer Fusion
* Fusing all laers wil increse the on-chip storage required
* This is not necessary
* Hierarchical design

## Evaluation

* Evaluate trade-ffs for all-possible fusing combinations
* Best points are along the bottom of the graph
* Low bandwidth
* Low-storage
* Pareto-optimal points are connected
## Reuslt
### SETUP
* Vivado HLS
* Xilinx Virtex-7 FPGA
* Focusing on only the early layer
### AlexNet(baseline)

* Used optimized design from FPGA
* Include pooling
* Assume non-linear operations can be completely overlapped with exitsting computation
* Assume non-linear op don't add new overhead
### VGGNet-E

* Same optimization design from FPGA
* Same conservative assumption
* Conversative cycle count
* No padding layer or pipeline filling each itteration
* 20% extra BRAMS for avoiding transfer of data
* Extra DSPS for control logic
## Discussion
* Long live data oriented design
* Results are tranferable to CPUs(2x sppedup over baseline)
* Authors claim difficult to apply to GPUs with current programming model