# DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture
###### tags: `Accelerators`
###### paper origin: MICRO 2020
###### papers: [link](https://ieeexplore.ieee.org/document/9251991)
###### video: `none`
###### slide: `none`
# 1. Intrduction
### Motivation
* Deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging.
* Not all the activations in DNNs need accurately computed results.
For example, ReLU in CNNs and sigmoid and tanh in RNNs, are **noise-resilient** in particular regions.

* As shown in Fig. 2, a large portion of activations are in the **insensitive** regions

# 2. Overview
### At algorithm level

* We propose the **dual-module processing method** and a **learning algorithm** to distill a lightweight approximate module from the original accurate module(i.e. the targeted DNN layer).
* approximate modules : for insensitive activations
* accurate modules : for sensitive activations
* y = the assembled final output vector

* Design target :
1. Much fewer computations and memory access than the original module.
2. Approximating the original modules accurately
* Design method :
* Use random projection to reduce dimension.
> QDR = quantization and dimension reduction step
* Use knowledge distillation method to train the approximate module.
> the knowledge distribution method for learning weights :
>
* Design a threshold-based neuron-wise dynamic switching method (m) to generate switching map.
>
>
### At hardware level
* The proposed architecture is a **DU**al-modul**E** archi**T**ecture.
* **Speculator** : running approximate modules
* **Executor** : running accurate modules
* The decoupled executor-speculator design features fine-grained pipeline and balanced execution in the Executor.
* challenge :
* the Speculator could become the **new bottleneck** or increase critical path latency.
* **imbalanced workloads** caused by neuron-wise dynamic switching lead to computing resources underutilized in the Executor.
### Contribution
* At algorithm level
* dual-module processing method, learning algorithm and threshold-based neuronwise dynamic switching method
* At hardware level
* specialized dual-module architecture for general DNN layer
i.e. speculator design with fine-grained pipeline
* online adaptive mapping
* DUET achieves **2.24x speedup and 1.97x energy reduction** on average with balanced execution and reduced memory access. Besides, the lightweight Speculator only consumes **6.6%** of total area and **less than 7%** of total energy consumption
# 3. Architecture
### Overall

* on-chip global buffer (GLB)
* store data for Executer and Speculator :
input, output, weight, **switching maps, mapping configuration**, approximate speculation results
* Network-on-Chip
* 1 Y-bus and 17 X-bus
* MC will compare the row ID and then the col ID.
* The unmatched X-buses and PEs are deactivated to save energy.
* MFU
* Perform non-linear activation (e.g., ReLU, tanh, and sigmoid) and generate final approximate results.
* Speculator
* Generate approximate results (uses outputs from the Executor to perform speculation) and dynamic switching maps.
* Executor
* Leverages the switching maps to reduce computations and memory access.
### Speculator
> design target : Generate approximate results and switch maps that supply the Executor to reduce computations with negligible loss of accuracy.

1. quantizationn :
INT16(from Executor’s high-dimensional execution) -> INT4
2. Dimension reduction :
Multiplies the quantized input with the projection matrix P.
3. Speculation :
Use a 16×32 Systolic Array to conduct **INT4 inner-product operations**.
MFU will calculate the final activated output and generate a switch map.


4. **adaptive mapping** :
* essential for CNN execution to help balance the PEs' workload
* Reorder filter ID that the Executor follows when computing the output feature map.
5. approximate activations will be stored in GLB when processing RNN.
### Executor
> design target : leverage the dynamic switching maps to reduce computation and memory access

* OMap = switch map from Speculator
IMap = the OMap of the last layer
* Ex : If IMap is 3x5x1 and filter is 3x3x1, then OMap will be 1x3x1, and each PE will do 27 MAC operations.
(kernel size=3x3, 9x3=27)
If there are **2/3 zeroes** in IMap & OMap, we'll only do **27x1/3x1/3=3** MAC operations in total.
* correction step
* if a predicted effectual neuron turns out to be ineffectual after ReLU, we will update the switching index of that neuron from 1 to 0 and then send it to the GLB.
* As a result, when the OMap is loaded as IMap for next layer, it will have even **higher sparsity** to save more computations.
# 4. Dataflow
### Processing CNNs with Balanced Execution

* How it work
* Executor & Speculator will work in pipeline.
* Problem
* The key challenge of computation skipping in CNNs is the **workload imbalance** caused by irregular sparsity distribution of OMap.
* solved by adapting mapping
### Reduceing Memory Accesses and Computations on RNNs

* How it work
* Consider an LSTM network, whose inference is executed element by element and then layer by layer.
* Due to the **limited on-chip memory capacity**, each time we can only load part of the weight matrix corresponded to a specific **gate**.
* Different from running CNN
* Apart from the switching maps, we also store the approximated results in the GLB.
* Problem
* **Off-chip memory access** greatly influences the overall performance and energy consumption.
* With the help of fine-grained pipeline and switch map
1. The speculation of gates can be hidden with the Executor’s computation.
2. The switching map indicates that if a neuron is ineffectual, we can skip the **related computations** and **weight matrix loading** from DRAM.

# 5. Evaluation
### At the algorithm level

* Negligible quality degradation (one perplexity increase) and **1.89x** reduction of **offchip weight data access**.
* 1% accuracy loss and **3.33x/5.15x** reduction of **operations**(i.e.FLOPs reduction) on AlexNet/ResNet18.
### Performance Speedup Analysis

> BOS : Balanced Output Switching
> IOS : integrated input and output switching maps
* Speedup : 1.2x(OS) -> 3.05x(DUET)
* BOS enhance speedup by adaptive mapping.
* IOS enhance more because more computations can be skipped.
* MAC utilization : 30%(OS) -> 39%(DUET)
* BOS can improve MAC utilization significantly with balanced map.
* IOS will decrease the utilization because of more computation reduction.
### speedup and energy efficiency

> EDP = energy-delay-product
* DUET achieves **2.24x** average speedup on typical CNN and RNN models.
* reduction of on-chip computations and on-chip/off-chip data access
* computations of QDR are low-dimension, low-precision, and low-cost.
### area consumption

* results from RTL synthesis by DesignCompiler under 45nm technology
* The primary area consumption comes from the **on-chip memory buffers**, while the Speculator only accounts for **6.6%** of the area.
### Conclusion
1. **Total computations are reduced** with dual-module processing.
2. For CNN : Hardware-efficient adaptive mapping ensures **high PE utilization** in the Executor.
3. For RNN : Advanced switching map generation greatly **reduced off-chip memory access** for memory-bound workloads.