Ptolemy: Architecture Support for Robust Deep Learning

# Ptolemy: Architecture Support for Robust Deep Learning ###### tags: `Defense` `Architecture` ###### paper origin: MICRO 2020 ###### papers: [link](https://arxiv.org/pdf/2008.09954.pdf) ###### video: [link](https://www.youtube.com/watch?v=IGWTg7QxMkM&ab_channel=HorizonResearch) ###### slide: [link](https://speakerdeck.com/horizonlab/ptolemy-architecture-support-for-robust-deep-learning-7c87e1b7-e956-43c5-9465-b9c0ae5c715d) ###### github: [link](https://github.com/Ptolemy-DL/Ptolemy) # 1. INTRODUCTION ## Research Problems * Today’s countermeasures to adversarial attacks **either do not have the capability to detect adversarial samples at inference-time, or introduce prohibitively high overhead to be practical at inference-time**. (e.g., adversarial retraining, redundancies defense => high overhead!!) ## Proposed Solutions * They aim to enable **fast and accurate systems** that can detect adversarial examples at inference-time such that proper measures could be taken. * This paper proposes **Ptolemy**, **an algorithm-architecture co-design system that detects adversarial attacks at inference time with low overhead and high accuracy**. This enables applications to reject incorrect results produced by adversarial attacks during inference. ## Contributions * We **propose a novel static-dynamic collaborative approach for adversarial detection** by exploiting the unique program execution characteristics of DNN inferences that are largely ignored before. * We **present a general algorithmic framework**, along with **a high-level programming interface**, that allows programmers to explore key algorithm design knobs to navigate the accuracy-effciency trade-offspace. * We demonstrate that with a **carefully-designed ISA, compiler optimizations** could enable effcient detection by exploiting the unique parallelisms and redundancies exposed by our detection algorithm framework. * We present a **programmable hardware** to achieve lowlatency online adversarial defense with principled extensions to existing DNN accelerators. # 2. Implementation ## Flow Chart ![](https://i.imgur.com/7flIlvl.png) ## Ptolemy Compiler and ISA * **Hides and reduces the detection overhead** by exploiting the unique parallelisms and redundancies exposed by the detection algorithms. * With the aggressive **compile-time optimizations** and **a well-defined ISA**, detection algorithms can be implemented on top of existing DNN accelerators with a set of basic, yet principled, **hardware extensions**, further widening the applicability of Ptolemy. ## Ptolemy Algorithmic ### Intuition and Key Concepts * Inputs that are correctly predicted as the same class tend to activate a unique set of neurons distinctive from that of other inputs. * A sequence of activated neurons is analogous to a sequence of basic blocks exercised by an input to a conventional program. The frequently exercised basic block sequences, i.e., “hot paths”. * Treat a DNN as an imperative program, and leverage its runtime paths (sequence of neurons) to guide adversarial sample detection. ### Important Neurons * Denote a set of neurons that contribute significantly to the inference output. * Important neurons are extracted in a backward fashion. The last layer has only one important neuron, which is the neuron n that corresponds to the predicted class. ### Countermeasures ![](https://i.imgur.com/lgqRgu1.png) * Extracting important neurons from a fully connected layer (left) and a convolution layer (middle), and constructing the activation path from important neurons across layers (right). Activation paths are input-specific. This figure illustrates backward extraction using a cumulative threshold. Forward extraction would start from the first layer rather than from the last layer. Absolute thresholding would select important neurons based on absolute partial sums rather than cumulative partial sums. ### Class Path Similarity ![](https://i.imgur.com/AXBkc05.png) * theta = 0.5, class paths are significantly different from each other. * The average inter-class path similarity is only 36.2% (max 38.2%, 90-percentile 36.6%) for AlexNet on ImageNet and 61.2% (max 65.1%, 90-percentile 63.4%)for ResNet18 on CIFAR-10. Suggesting that class paths are distinctive. * The class path similarity is much higher in CIFAR-10 than in ImageNet. This is because ImageNet has 1,000 classes that cover a wide range of objects and CIFAR-10 has only 10 classes, which are similar to each other (e.g., cat vs. dog). The randomly picked 10 classes in ImageNet are more likely to be different from each other than the 10 classes in CIFAR- 10. Across all the 1,000 classes in ImageNet, the maximum inter-class path similarity is still only 0.44, suggesting that our random sampling of ImageNet is representative. ### Algorithm Framework ![](https://i.imgur.com/7nAHCq8.png) * Adversarial detection algorithm framework. It provides a range of knobs for path extraction, which dominates the runtime overhead. Note that the path extraction methods in both the offine and online phases must match. * Their profiling method can easily integrate new training samples, whose activation paths would simply be aggregated (OR-ed) with the existing class paths without having to re-generate the entire class paths from scratch. * Activation paths are extracted only after the entire DNN inference finishes, because the identification of important neurons starts from the predicted class in the last layer and propagates backward. ### Cost Analysis * A pure software implementation of the detection algorithm introduces 15.4x and 50.7x overhead over inference on AlexNet and ResNet50, respectively. ## Algorithmic Knobs and Variants (reduce the detection cost) ### Hiding Detection Cost: Extraction Direction * The key to the new algorithm is to extract important neurons in a forward rather than a backward manner. * In the original backward extraction process, we use the important neurons in layer Li’s output (which is equivalent to layer Li+1’s input) to identify the important neurons in layer Li’s input. * In our new forward extraction process, as soon as layer Li finishes inference we first determine the important neurons in its output by simply ranking output neurons according to their numerical values and selecting the largest neurons, instead of waiting until after the extraction of layer Li+1. * In this way, the extraction of important neurons at layer Li and the inference of layer Li+1 can be overlapped. * Hiding can not reduce the energy overhead. ### Reducing Detection Cost: Thresholding Mechanism * To reduce the detection cost, we propose to extract important neurons using absolute thresholds rather than cumulative thresholds.Whenever a partial sum is generated during inference it is compared against an absolute threshold. * Writing single-bit masks rather than partial sums. * Using absolute thresholds with single-bit masks significantly reduces both the compute and memory costs. ### Reducing Detection Cost: Selective Extraction * Extract important neurons from just the last a few layers. * When combined with forward extraction, this is equivalent to starting extraction later (“late-start”); when combined with backward extraction, this is equivalent to terminating extraction earlier (“early-termination”). ## Programming Interface * ISA ![](https://i.imgur.com/FnHCkZz.png) * Python-based programming interface ![](https://i.imgur.com/eeqe6UY.png) ## ISA and Compiler Optimizations * Tolemy provides a custom CISC-like ISA to allow effcient mapping from high-level detection algorithms to the hardware architecture. * A 24-bit fixed length encoding,and provide 16 general-purpose registers. ![](https://i.imgur.com/7QlrXH2.png) * Example ![](https://i.imgur.com/fsobVTo.png) ## Code Generation and Optimization * Instruction scheduling examples. (The code in (b) is the extraction block simplified in (a)) * ![](https://i.imgur.com/MHz9Pdi.png) ### Layer-Level Pipelining (a) * A key characteristic of algorithms that use the forward extraction method is that inference and extraction of different layers can be overlapped. ### Neuron-Level Pipelining (b) * Similar to layer-level pipelining, our compiler will also automatically pipeline the extraction of different important neurons within a layer. ### Trading-off Compute for Memory * Algorithms that use cumulative thresholds have high memory cost because all the partial sums must be stored to memory(the green table). * Propose to use redundant computation to reduce memory overhead. Instead of storing all the partial sums during inference, we re-compute the partial sums during the extraction process only for the receptive fields that are known to correspond to important neurons in the output feature map. The compiler implements this by generating csps instructions to re-compute partial sums. ## Ptolemy Architecture ![](https://i.imgur.com/MRSxon0.png) ### Enhanced DNN Accelerator * Ptolemy can be integrated into general DNN accelerator designs. Without losing generality we assume **a TPU-like systolic array** design. Each PE consists of **two 16-bit input** registers, **a 16-bit fixed-point MAC unit** with **a 32-bit accumulator register**, and simple trivial control logic. ![](https://i.imgur.com/9kO3YgL.png) ### Path Constructor Accelerator * The goal of the path constructor is to extract important neurons and to construct activation paths. Algorithms that use cumulative thresholds requires sorting partial sums in receptive fields. ![](https://i.imgur.com/vvSw4QN.png) ### Controller * **A micro-controller unit** (MCU). ### Dispatching Instructions * The compiled programs can be interpreted on the MCU (i.e., software decoding) effciently while avoiding extra hardware cost. * The largest one, which uses cumulative thresholds and backward extraction, is **about 30 static instructions (below 100 bytes)**. ### Classification * Our particular **Random Forest** implementation uses 100 decision trees, each of which has an average depth of 12. In total, RF consumes **about 2,000 operations on AlexNet** (five orders of magnitude lower than inference), and could execute on an MCU in microseconds. # Evaluation Methodology ## ~~Experimental Setup~~ (skip) ## Evaluation Plan ![](https://i.imgur.com/NivOGLj.png) ## Baselines: EP, CDRP, DeepFense * Both EP and CDRP leverage class-level sparsity. CDRP requires retraining and thus is not able to detect adversaries at inference-time. * DeepFense represents a class of detection mechanisms that use modular redundancy. DeepFense employs multiple latent models as redundancies. We directly use the accuracy results reported in their papers. # 3. Result ## Area Overhead * On top of the baseline DNN accelerator, Ptolemy introduces a **total area overhead of 5.2%** (0.08 mm2), of which 3.9% is contributed by the additional SRAM. The rest of the area overhead is attributed to the MAC unit augmentation (0.4%) and other logic (0.9%). ## DRAM Space * With the recompute optimization, AlexNet, ResNet18, and VGG19 **require only an extra 12.8 MB, 17.6 MB, and 148.0 MB** in DRAM. * The additional DRAM traffic is **less than 0.1%** ## Accuracy ![](https://i.imgur.com/8h5r6PB.png) ![](https://i.imgur.com/TLVYs0c.png) ## Latency and Energy ![](https://i.imgur.com/ss88DXp.png) ## Theta Compare ![](https://i.imgur.com/yShX0vP.png) ## Compare With DeepFense ![](https://i.imgur.com/XEfCZsV.png) ![](https://i.imgur.com/UlPPYRz.png) ## BwCu v.s. FwAb * BwCu ![](https://i.imgur.com/18OYo3e.png) * FwAb ![](https://i.imgur.com/iuUcanP.png) ## Performance vary with hardware resource ![](https://i.imgur.com/CNXkanR.png) ## Large Model Evaluation * On **VGG16 and Inception-V4**, the average interclass path similarity **on ImageNet** is only **41.5% and 28.8%**, respectively, indicating that important neurons exist and class paths are unique in these models. * We also applied our detection scheme to DenseNet, and achieved 100% detection accuracy with 0% false positive rate (FPR), higher than the previously best accuracy at 96% with 3.8% FPR. We use the detection accuracy and false positive rate instead of AUC in order to directly compare with the referenced method. We also evaluated ResNet50 on ImageNet using BwCu. The accuracy is 0.900, which is more accurate than EP (0.898). # 4. Report