DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture

# DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture ###### tags: `Accelerators` ###### paper origin: MICRO 2020 ###### papers: [link](https://ieeexplore.ieee.org/document/9251991) ###### video: `none` ###### slide: `none` # 1. Intrduction ### Motivation * Deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging. * Not all the activations in DNNs need accurately computed results. For example, ReLU in CNNs and sigmoid and tanh in RNNs, are **noise-resilient** in particular regions. ![](https://i.imgur.com/GiJYdIl.png) * As shown in Fig. 2, a large portion of activations are in the **insensitive** regions ![](https://i.imgur.com/CP7Nyb5.png) # 2. Overview ### At algorithm level ![](https://i.imgur.com/VS7060H.png =350x250) * We propose the **dual-module processing method** and a **learning algorithm** to distill a lightweight approximate module from the original accurate module(i.e. the targeted DNN layer). * approximate modules : for insensitive activations * accurate modules : for sensitive activations * y = the assembled final output vector ![](https://i.imgur.com/0whwfth.png) * Design target : 1. Much fewer computations and memory access than the original module. 2. Approximating the original modules accurately * Design method : * Use random projection to reduce dimension. > QDR = quantization and dimension reduction step * Use knowledge distillation method to train the approximate module. > the knowledge distribution method for learning weights : >![](https://i.imgur.com/IebkXtR.png) * Design a threshold-based neuron-wise dynamic switching method (m) to generate switching map. >![](https://i.imgur.com/MAt5DMD.png) >![](https://i.imgur.com/dHZ8rnB.png)![](https://i.imgur.com/f7mxvhM.png) ### At hardware level * The proposed architecture is a **DU**al-modul**E** archi**T**ecture. * **Speculator** : running approximate modules * **Executor** : running accurate modules * The decoupled executor-speculator design features fine-grained pipeline and balanced execution in the Executor. * challenge : * the Speculator could become the **new bottleneck** or increase critical path latency. * **imbalanced workloads** caused by neuron-wise dynamic switching lead to computing resources underutilized in the Executor. ### Contribution * At algorithm level * dual-module processing method, learning algorithm and threshold-based neuronwise dynamic switching method * At hardware level * specialized dual-module architecture for general DNN layer i.e. speculator design with fine-grained pipeline * online adaptive mapping * DUET achieves **2.24x speedup and 1.97x energy reduction** on average with balanced execution and reduced memory access. Besides, the lightweight Speculator only consumes **6.6%** of total area and **less than 7%** of total energy consumption # 3. Architecture ### Overall ![](https://i.imgur.com/woZVdpA.png) * on-chip global buffer (GLB) * store data for Executer and Speculator : input, output, weight, **switching maps, mapping configuration**, approximate speculation results * Network-on-Chip * 1 Y-bus and 17 X-bus * MC will compare the row ID and then the col ID. * The unmatched X-buses and PEs are deactivated to save energy. * MFU * Perform non-linear activation (e.g., ReLU, tanh, and sigmoid) and generate final approximate results. * Speculator * Generate approximate results (uses outputs from the Executor to perform speculation) and dynamic switching maps. * Executor * Leverages the switching maps to reduce computations and memory access. ### Speculator > design target : Generate approximate results and switch maps that supply the Executor to reduce computations with negligible loss of accuracy. ![](https://i.imgur.com/19HzbEC.png) 1. quantizationn : INT16(from Executor’s high-dimensional execution) -> INT4 2. Dimension reduction : Multiplies the quantized input with the projection matrix P. 3. Speculation : Use a 16×32 Systolic Array to conduct **INT4 inner-product operations**. MFU will calculate the final activated output and generate a switch map. ![](https://i.imgur.com/QJm1xCJ.png =350x250) ![](https://i.imgur.com/w1u3VB0.png =400x200) 4. **adaptive mapping** : * essential for CNN execution to help balance the PEs' workload * Reorder filter ID that the Executor follows when computing the output feature map. 5. approximate activations will be stored in GLB when processing RNN. ### Executor > design target : leverage the dynamic switching maps to reduce computation and memory access ![](https://i.imgur.com/r6Cejka.png) * OMap = switch map from Speculator IMap = the OMap of the last layer * Ex : If IMap is 3x5x1 and filter is 3x3x1, then OMap will be 1x3x1, and each PE will do 27 MAC operations. (kernel size=3x3, 9x3=27) If there are **2/3 zeroes** in IMap & OMap, we'll only do **27x1/3x1/3=3** MAC operations in total. * correction step * if a predicted effectual neuron turns out to be ineffectual after ReLU, we will update the switching index of that neuron from 1 to 0 and then send it to the GLB. * As a result, when the OMap is loaded as IMap for next layer, it will have even **higher sparsity** to save more computations. # 4. Dataflow ### Processing CNNs with Balanced Execution ![](https://i.imgur.com/CSlV2md.png) * How it work * Executor & Speculator will work in pipeline. * Problem * The key challenge of computation skipping in CNNs is the **workload imbalance** caused by irregular sparsity distribution of OMap. * solved by adapting mapping ### Reduceing Memory Accesses and Computations on RNNs ![](https://i.imgur.com/khqUmOa.png) * How it work * Consider an LSTM network, whose inference is executed element by element and then layer by layer. * Due to the **limited on-chip memory capacity**, each time we can only load part of the weight matrix corresponded to a specific **gate**. * Different from running CNN * Apart from the switching maps, we also store the approximated results in the GLB. * Problem * **Off-chip memory access** greatly influences the overall performance and energy consumption. * With the help of fine-grained pipeline and switch map 1. The speculation of gates can be hidden with the Executor’s computation. 2. The switching map indicates that if a neuron is ineffectual, we can skip the **related computations** and **weight matrix loading** from DRAM. ![](https://i.imgur.com/yvGEiq1.png) # 5. Evaluation ### At the algorithm level ![](https://i.imgur.com/HCcQLcd.png =500x350) * Negligible quality degradation (one perplexity increase) and **1.89x** reduction of **offchip weight data access**. * 1% accuracy loss and **3.33x/5.15x** reduction of **operations**(i.e.FLOPs reduction) on AlexNet/ResNet18. ### Performance Speedup Analysis ![](https://i.imgur.com/PA4JDYv.png) > BOS : Balanced Output Switching > IOS : integrated input and output switching maps * Speedup : 1.2x(OS) -> 3.05x(DUET) * BOS enhance speedup by adaptive mapping. * IOS enhance more because more computations can be skipped. * MAC utilization : 30%(OS) -> 39%(DUET) * BOS can improve MAC utilization significantly with balanced map. * IOS will decrease the utilization because of more computation reduction. ### speedup and energy efficiency ![](https://i.imgur.com/kb2O5sj.png) > EDP = energy-delay-product * DUET achieves **2.24x** average speedup on typical CNN and RNN models. * reduction of on-chip computations and on-chip/off-chip data access * computations of QDR are low-dimension, low-precision, and low-cost. ### area consumption ![](https://i.imgur.com/rGUT2ex.png =350x250) * results from RTL synthesis by DesignCompiler under 45nm technology * The primary area consumption comes from the **on-chip memory buffers**, while the Speculator only accounts for **6.6%** of the area. ### Conclusion 1. **Total computations are reduced** with dual-module processing. 2. For CNN : Hardware-efficient adaptive mapping ensures **high PE utilization** in the Executor. 3. For RNN : Advanced switching map generation greatly **reduced off-chip memory access** for memory-bound workloads.