Computation Reuse in DNNs by Exploiting Input Similarity

# Computation Reuse in DNNs by Exploiting Input Similarity ###### tags: `Accelerators` ###### paper origin: ISCA'18 ###### paper: [link](https://dl.acm.org/doi/10.1109/ISCA.2018.00016) ## Introduction ### Motivation * Popular applications, such as speech recognition or video classification, require mulitple back-to-back executions of a DNN to process a sequence of inputs. * Consecutive inputs exhibit a high degree of similarity, causing the inputs/outputs of the different layers to be extremely similar for successive frames of speech or images of a video. ![](https://i.imgur.com/T4qSjd8.png) ### Problem * Previous hardware-accelerated DNN systems provide efficient implementations of fully-connected and convolutional layers, optimized for an **isolated execution**. * Many computations and memory accesses are redundant if we take into account successive executions of a DNN, especially for applications that processs a temporal sequence of inputs(eg. speech, video). ### Solution * We leverage the error tolerant of DNNs to boost the potential of reuse across successive executions of the DNN. * Linear quantization of inputs is a very effective mechnism to increase the ability of out technique to exploit redundancy. * We propose a mechanism that computes the outputs of each DNN layer by reusing the buffered results for the previous execution. ![](https://i.imgur.com/GJFqAzU.png) if the first two inputs are the same in both executions, the output can be computed more efficienctly as: ![](https://i.imgur.com/ZnxlnCS.png) ## Temporal Reuse ![](https://i.imgur.com/Zh98TXM.png) ![](https://i.imgur.com/x2Mpzib.png) * 16 clusters for *Kaldi* and *EESEN* * 32 clusters for *C3D* and *AutoPilot* ## Reuse-based DNN Accelerator ![](https://i.imgur.com/HlmDNgx.png) ### DNN Accelerator Overview #### Compute Engine(CE) * contains the functional units that perform the FP computations, including an array of FP multipliers and adders, specialized functional units. * employed to quantize the inputs in our reuse scheme. #### Control Unit(CU) * contains the configuration of the DNN. * stores the centroids of the clusters emplyed for the quantization of the inputs fo each layer. #### I/O Buffer * an SRAM memory with two banks used to store the intermediate inputs/outputs between two DNN layers. #### Data Master * fetching the corresponding weights and inputs from the on-chip memories, and dispatching them to the functional units in the CE. ### Computation Reuse in FC Layers ![](https://i.imgur.com/2qgm0f4.png) ### Computation Reuse in CONV Layers ![](https://i.imgur.com/TDUUg3w.png) ## Evaluation Methodology * Computation resuse scheme * simulator ![](https://i.imgur.com/8WKYK5l.png) * area and energy consumption * Verilog * 28/32nm techonology * CACTI-P for memory * workload ![](https://i.imgur.com/6FETai2.png) * Baseline * CPU: Intel i7 7700K (energy consumption collecting by RAPL library) * GPU: NVIDIA GeForce GTX 1080 (power dissipation measured by nvidia-smi) ## Experimental Results ### Speedups achieved by the computation reuse scheme ![](https://i.imgur.com/FxoKmAS.png) * The reduction in execution time is due to reamin unmodified with respect to the previous DNN execution do not require any computation or memory access. * The overhead of performing the quantization and comparing the current input with the previous one is fairly small. ### Normalized energy ![](https://i.imgur.com/5ZAELEr.png) * The energy savings are well corelated with the degree of input similarity and computation reuse. ### Energy breakdown ![](https://i.imgur.com/PCYdN1r.png) * The energy savings achieved by resuse scheme are significant in all components, and are especilly large in the on-chip eDRAM memory. ### Overhead ![](https://i.imgur.com/XDOuuhw.png) * The overall overhead in area of of the accelerator is less than 1%, as it increases from 52mm^2^ to 53mm^2^. ### Compare with modern CPU and GPU ![](https://i.imgur.com/gGhzr40.png)