# Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
###### paper origin: ISCA 2017
###### Link: [Paper](https://cgi.luddy.indiana.edu/~lukefahr/papers/jiecaoyu_isca17.pdf)
## Motivation:
As DNNs continue to grow in size and complexity, their energy footprint also increases. Weight pruning is a common technique for reducing model size and computation, but it can lead to network sparsity that affects performance.
## Main Challenge:
How can we customize DNN pruning to underlying hardware parallelism to improve performance and reduce energy consumption?
## Solution:
The Scalpel approach proposes two techniques for customizing DNN pruning to different hardware platforms based on their parallelism: SIMD-aware weight pruning and node pruning.
* **SIMD-aware weight pruning** is applied to low-parallelism hardware to fully utilize SIMD units.
* **Node pruning** removes redundant nodes on high-parallelism hardware. This approach improves overall performance and reduces energy consumption.
### DNN Struture

In FC layers, all input values are connected to every neuron. For CONV layers, as shown in Figure 1, they consist of a stack of 2D matrices named feature maps.
FC layers perform matrix-vector multiplication, and CONV layers perform matrix-matrix multiplication
## Challenge
### 1.sparse weight matrix spends too much extra data to record the sparse matrix format

* A holds all the nonzero values.
* IA records the index into A of the first nonzero element in each row of W.
* JA stores the column indexes of the nonzero elements.
Since the index array JA has the same size with the data array A, more than half of the data in the CSR format is used to store the sparse matrix format.
### 2. Weight pruning can hurt the DNN computation performance

### A breakdown of the execution time for the dense and the sparse DNN models on the CPU

### The numbers of cache accesses and cache misses for this layer broken down by loads and stores

## Approach
### Overview of Scalpel

* All general-purpose hardware platforms are divided into three categories based on their internal parallelism: low parallelism, moderate parallelism, and high parallelism.
* For low-parallelism hardware, SIMD-aware weight pruning is applied.
* For high-parallelism hardware, node pruning is applied.
* For hardware with moderate parallelism, we use a combination of SIMD-aware weight pruning and node pruning.
### SIMD-Aware Weight Pruning


* A' stores all the nonzero weight groups with the original order.
* IA' records the index into A' of the first nonzero element in each row of W.
* JA' stores the column index of each group.
### The peak performance benefit from SIMD-aware weight pruning.

### Node Prunning
### The relative execution time of sparse matrix-matrix multiplication on GPU against the pruning rate.

To avoid this performance decrease, node pruning removes DNN redundancy by removing entire nodes instead of weights. It uses mask layers to dynamically find out unimportant nodes and block their outputs. The blocked nodes are removed after the training of mask layers. After removing all redundant nodes, mask layers are removed, and the network is retrained to get the pruned DNN model.
### Main steps of node pruning

### Mask layers

### Combined Prunning
### The relative execution time for sparse matrix-vector and matrix-matrix multiplication on Intel Core i7-6700 CPU, respectively.

## Experiment Methodology
### Hardware platforms
1. Microcontroller - low parallelism. We use ARM Cortex-M4 microcontroller which has a 3-stage in-order pipeline and 2-way SIMD units for 16-bit fixed-point numbers.
3. CPU - moderate parallelism. We use Intel Core i7-6700 CPU, which is a Skylake class core. It supports 8-way SIMD instructions for 32-bit floating-point numbers
5. GPU - high parallelism. We use NVIDIA GTX Titan X which is a state-of-the-art GPU for deep learning and included in the NVIDIA Digits Deep Learning DevBox machine.
### Benchmarks

### Experiment baselines
We use the original dense DNN models as the baseline.
1. Traditional
2. Optimal
3. Scaple
## Evaluation Result
### Overview

### 5.1 Microcontroller - Low Parallelism
#### The relative performance speedups and relative model sizes of the original models, traditional pruning, optimized pruning and Scalpel.

#### The curves of relative accuracy against pruning rate for the three metrics

### CPU - Moderate Parallelism
#### The relative performance speedups and relative model sizes of the original models, traditional pruning and Scalpel.
 
### GPU - High Parallelism
#### The relative performanc speedups of the original models, traditional pruning, optimized pruning and Scalpel.

Models generated by Scalpel, except ConvNet and NIN, have higher model sizes than those generated by traditional and optimized pruning.
#### The percentage of nodes we can remove from each layer in DNNs

## Related Work
In the Scalpel paper, the authors provided a survey of existing DNN pruning methods. They pointed out that existing pruning methods often suffer from the following issues: 1) they may destroy the structure of DNNs, leading to performance degradation; 2) they may introduce sparsity, increasing storage and computation overheads; 3) they may not fully exploit the parallelism and computational power of hardware platforms. To address these issues, the authors proposed the Scalpel method.
## Conclusion:
In the Scalpel paper, the authors proposed a new DNN pruning method, namely Scalpel. Scalpel customizes DNN pruning through two methods (SIMD-aware weight pruning and node pruning) to improve performance and reduce energy consumption. The authors also compared Scalpel with traditional pruning and Optimal Brain Surgeon (OBS) on different hardware platforms. The experimental results showed that Scalpel achieved better performance and efficiency on all three hardware platforms. Therefore, Scalpel is a promising DNN pruning method that deserves further exploration and application in future research.