# Fused-Layer CNN Accelerators ###### tags: `Accelerators` ###### members: @Mickeyyayyaya ## Abstract * Accelerators for efficiently evaluating CNNs are rapidly growing in popularity. * The conventional approaches to designing such CNN accelerators is to focus on creating accelerators to iteratively process the CNN layers. ### Issue * By processing each layer to completion, the accelerator designs must use off-chip memory to store intermediate data between layers, because the <font color= red >intermediate data are too large to fit on chip</font>. ## Introduction ### 1. Significant interest in developing and adapting hardware accelerators for CNN accerlerators * GPUs, FPGAs, and ASICs ### 2. Sheer volume of operations precludes a dataflow implementation * Even for a single layer ### 3. Current approach( at the time ) * Evaluate the network by following its structure, one layer at a time. * Intermediate data are streamed back to the same compute units, repeating the process until all layers have been evaluated. ### 4. Issue :::danger * Wasted data movement ::: ## Designing CNN accelerator ![](https://i.imgur.com/4zaWLge.jpg) * N-channels of R*C value * Convolved with M-sets of N*k*k * Output feature maps then undergo some kind of non-linear operations * E.g. ReLU * Optionally * Increasing the depth (number oflayers) of a network yield higher recognition accuracy * 2012 AlexNet: 5 * 2014 VGG: 17 * 2014 GooLeNet: 20 ### A. Hardware Accelerators For each layer, input feature maps and filter weights are brought from off-chip DRAM into local buffers, the convolutions are performed, and output feature map data are written into DRAM. ::: danger The large volume of data comprising the feature maps stresses the memory system and can become the bottleneck. ::: This has inspired efforts to optimize the memory accesses patterns for this layer-by-layer approach. ### B. Data Access Patterns ![](https://i.imgur.com/2OXG0Di.jpg) * In the first eight layers, the sum of the inputs and outputsis much higher than the weights. * Prior work tried solving symptoms(effectively managing the on/off-chip bandwidth) * But the networks are growing * 25% of Alexnet's data used is feature maps * 50% in GoogLenet * Remove these accesses completely ::: success In this paper, we demonstrate layer fusion, our technique to minimize the off-chip data movement between layers by re-organizing the evaluation process. ::: ## Fused-layer CNN Accelerators ### Key ideas * Fuse 2+ convolution layers * Only the input feature maps of the first fused layer are transferred from DRAM * Compute the intermediate values of all of the fused layers that depend on the data region being read ### A. Overview ![](https://i.imgur.com/NYaAXwj.jpg) * Tile the original feature maps into 5 X 5 X N chunks * Layer 1 convolves with all M-sets of 3 X 3 X N filters * Layer 2 does the same, but on the 3 X 3 X M values * Produces 1 X 1 X P results ::: success No longer have to save the immediate feature map! **Side-effect:** 1. We don't have to load in much new data as the tile moves 2. Overlapped computation * Re-compute? * Cache? ::: ### B. Exploration Framework ::: info How do you determine how much space is required? ::: * Work backwards * Use dimensions of filters/sets/tiles/etc. to quantify * Similarly, we can compute the cost of the re-computation vs caches ### C. Recomputing vs. Storing * From fig.3: * 6M blue(purple) values * Each required 9N mul and add * Total = 6M(9+9)N = 108MN * Used multipled times * L-R + T-B * VGGNet-E * Layer 1: M=64 N=3 * Layer 2: M=64 N=64 * AlexNet * Fusing the first two layers adds 768M extra mul + add op to avoid off-chip transfers * Requires only 55.86KB extra storage to avoid this * VGGNet-E * Fusing all 19 layers * 470B extra * 1.4MB storage ### D. Partitioning Networks for Layer Fusion * Fusing all laers wil increse the on-chip storage required * This is not necessary * Hierarchical design ![](https://i.imgur.com/Mq676eF.jpg) ## Evaluation ![](https://i.imgur.com/aBVlQaV.jpg) * Evaluate trade-ffs for all-possible fusing combinations * Best points are along the bottom of the graph * Low bandwidth * Low-storage * Pareto-optimal points are connected ## Reuslt ### SETUP * Vivado HLS * Xilinx Virtex-7 FPGA * Focusing on only the early layer ### AlexNet(baseline) ![](https://i.imgur.com/xVXW7dz.jpg) * Used optimized design from FPGA * Include pooling * Assume non-linear operations can be completely overlapped with exitsting computation * Assume non-linear op don't add new overhead ### VGGNet-E ![](https://i.imgur.com/rHKaUTX.jpg) * Same optimization design from FPGA * Same conservative assumption * Conversative cycle count * No padding layer or pipeline filling each itteration * 20% extra BRAMS for avoiding transfer of data * Extra DSPS for control logic ## Discussion * Long live data oriented design * Results are tranferable to CPUs(2x sppedup over baseline) * Authors claim difficult to apply to GPUs with current programming model