# Computation Reuse in DNNs by Exploiting Input Similarity
###### tags: `Accelerators`
###### paper origin: ISCA'18
###### paper: [link](https://dl.acm.org/doi/10.1109/ISCA.2018.00016)
## Introduction
### Motivation
* Popular applications, such as speech recognition or video classification, require mulitple back-to-back executions of a DNN to process a sequence of inputs.
* Consecutive inputs exhibit a high degree of similarity, causing the inputs/outputs of the different layers to be extremely similar for successive frames of speech or images of a video.

### Problem
* Previous hardware-accelerated DNN systems provide efficient implementations of fully-connected and convolutional layers, optimized for an **isolated execution**.
* Many computations and memory accesses are redundant if we take into account successive executions of a DNN, especially for applications that processs a temporal sequence of inputs(eg. speech, video).
### Solution
* We leverage the error tolerant of DNNs to boost the potential of reuse across successive executions of the DNN.
* Linear quantization of inputs is a very effective mechnism to increase the ability of out technique to exploit redundancy.
* We propose a mechanism that computes the outputs of each DNN layer by reusing the buffered results for the previous execution.

if the first two inputs are the same in both executions, the output can be computed more efficienctly as:

## Temporal Reuse


* 16 clusters for *Kaldi* and *EESEN*
* 32 clusters for *C3D* and *AutoPilot*
## Reuse-based DNN Accelerator

### DNN Accelerator Overview
#### Compute Engine(CE)
* contains the functional units that perform the FP computations, including an array of FP multipliers and adders, specialized functional units.
* employed to quantize the inputs in our reuse scheme.
#### Control Unit(CU)
* contains the configuration of the DNN.
* stores the centroids of the clusters emplyed for the quantization of the inputs fo each layer.
#### I/O Buffer
* an SRAM memory with two banks used to store the intermediate inputs/outputs between two DNN layers.
#### Data Master
* fetching the corresponding weights and inputs from the on-chip memories, and dispatching them to the functional units in the CE.
### Computation Reuse in FC Layers

### Computation Reuse in CONV Layers

## Evaluation Methodology
* Computation resuse scheme
* simulator

* area and energy consumption
* Verilog
* 28/32nm techonology
* CACTI-P for memory
* workload

* Baseline
* CPU: Intel i7 7700K (energy consumption collecting by RAPL library)
* GPU: NVIDIA GeForce GTX 1080 (power dissipation measured by nvidia-smi)
## Experimental Results
### Speedups achieved by the computation reuse scheme

* The reduction in execution time is due to reamin unmodified with respect to the previous DNN execution do not require any computation or memory access.
* The overhead of performing the quantization and comparing the current input with the previous one is fairly small.
### Normalized energy

* The energy savings are well corelated with the degree of input similarity and computation reuse.
### Energy breakdown

* The energy savings achieved by resuse scheme are significant in all components, and are especilly large in the on-chip eDRAM memory.
### Overhead

* The overall overhead in area of of the accelerator is less than 1%, as it increases from 52mm^2^ to 53mm^2^.
### Compare with modern CPU and GPU
