ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

# ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions ###### tags: `Accelerators` ###### paper origin: MICRO 52 ###### papers: [link](https://dl.acm.org/doi/pdf/10.1145/3352460.3358305) ###### slides and video: `none` # 1. INTRODUCTION ## Problem * Although DNNs are perceived as compute-intensive tasks, they also apply **intense pressure on the capacity and bandwidth of the memory hierarchy**, primarily due to the large intermediate data communicated across network layers. * They observe that the **DNN intermediate data is either sequentially streamed or reshaped with a regular transformation between layers**. Hence, accesses to this data can tolerate a sequential or block sequential **compression/expansion without requiring random element retrieval**. * There are two types of sparsity in DNNs: model sparsity and feature map (or activation map) sparsity. **Model sparsity arises due to redundancy, zero values and over-precision of weights**. **Feature map compression cannot be achieved with offline processing.** The activation data needs to be dynamically compressed as it is being generated, which is more challenging than model compression. * Zero value ratio and memory footprint for VGG-16 (batch size 64): (ReLU and drop-out -> lots of 0) ![](https://i.imgur.com/yOcE1y4.png) * CPU cycle breakdown for DNN benchmarks & Memory footprint of key data structures for DNN benchmarks. ![](https://i.imgur.com/8pur0c4.png) ## Solution * They propose **ZCOMP, a CPU vector ISA extension tailored for DNN cross-layer communication.** * ZCOMP compactly represents **zero value compression/expansion** and **fully automates the metadata generation**, storage and retrieval which **eliminates the need for several extra instruction executions and register usage**. * ZCOMP can be targeted both for **inference and training** to dynamically compress/expand cross-layer data before being written to memory. * ZCOMP offers **ubstantial data traffic reduction**, both on-chip across cache-hierarchy and off-chip to DRAM, **and support AVX512 compression approaches**. * **Specifically, ZCOMP enables zero data values in feature maps to be compactly encoded, while fully automating the meta-data generation, storage and retrieval, which eliminates the need for extra instruction executions and register usage.** ## Contributions * They evaluate feature map memory footprints, their sparsity, and impact on the memory subsystem for DNN workloads running on CPUs. * They analyze existing AVX512-ISA approaches for feature map compression and demonstrate their limitations. * They propose ZCOMP, an efficient vector ISA extension tailored for feature map compression, and explain its microarchitecture and software interface. * They evaluate ZCOMP by detailed experimental studies using Intel Caffe ReLU activation layer for several DeepBench configurations and full networks from Google TensorFlow. * Their results show that ZCOMP offers an average 31% reduction in memory traffic and an average 11% performance improvement for training popular DNNs. # 2. Implementation ## Interleaved Header `zcomps` and `zcompl` * Micro-architecture for zcomps or zcompl(a SIMD vector instruction) instruction with interleaved header (512-bit vector fp32 elements). ![](https://i.imgur.com/MZdB5rB.png) CCF = comparison condition flag Note that while the vector size after expansion is known, the source data can have any arbitrary size based on its compression ratio.However, the information regarding the source data size is already contained in the header. ![](https://i.imgur.com/LBdFpCq.png) ## Separate Header `zcomps` and `zcompl` * The interleaved header approach aims to fit headers and the compressed data into the original memory space. However, **if the compressibility of the data is completely unknown, separately and explicitly allocating memory for the metadata is desirable**. * To achieve this goal, the second variation of ZCOMP instructions, separate-header, decouples the metadata (header) and the compressed data storage and retrieval. * The additional register reg3 holds the pointer to the separated header store. * `zcomps [reg2], reg1, [reg3] #CCF` * `zcompl reg1, [reg2], [reg3]` ## ZCOMP USAGE IN DNN PROCESSING * **ZCOMP provides a software-friendly interface** for drop-in replacement of store and load instructions with zcomps and zcompl for DNN cross-layer data transfers. With ZCOMP, the software does not need to worry about managing compression metadata or checking if the data is actually compressible. * The feature map compression ratios are dynamic parameters that depend on the inputs. Therefore, it is not possible to allocate the memory space before passing the inputs through the layers. Unless the intermediate buffers are allocated/deallocated for every input batch, which can lead to performance issues. * **A memory violation can happen without enough compressibility.** * **If the compressibility is completely unknown, there are two different options for memory allocation.** First option is to continue to use interleaved header, but to modify the memory allocations for the intermediate data buffers to account for the full uncompressed data plus the metadata size. Second option is to use separate-header variant where the metadata is completely decoupled from the data. ## Software API via Intrinsic Functions ![](https://i.imgur.com/QODutBi.png) ## ZCOMP parallelization approaches ![](https://i.imgur.com/ZTfOKmc.png) ## AVX512 and ZCOMP Usage Comparison ### ZCOMP zcomps and zcompl In each of the examples, parallelization is expressed via OpenMP pragma directives. * Storing ReLU generated feature maps by using the proposed zcomps instruction. ![](https://i.imgur.com/pHv7Vhs.png) * Retrieving cross-layer feature maps by using the proposed zcompl instruction. ![](https://i.imgur.com/1RmTZ11.png) ### AVX512 vcompress and vexpand * Storing ReLU generated feature maps by using AVX512 compress instruction. ![](https://i.imgur.com/NYOZyLx.png) * Retrieving cross-layer feature maps by using AVX512 expand instruction. ![](https://i.imgur.com/p34lPBA.png) ### Comparison * They observe that **AVX512 vcompress and vexpand require 5-6 extra static scalar/vector instructions inside the loop body**, and **use 4-5 additional registers**, compared to ZCOMP. Hence, over many iterations, AVX512 implementations incur larger dynamic instruction overheads. ## Simulation Methodology ![](https://i.imgur.com/Pbn4xWJ.png) * **Sniper** multi-core simulator. * Since existing compilers cannot generate ZCOMP instructions, our methodology is to **use AVX512 store/load instructions instead of zcomps/zcompl but label them differently**. When a labelled AVX512 store/load is encountered, the default behavior is overridden and the instructions are treated as zcomps/zcompl. # 3. Result * ZCOMP benefits over existing AVX512 instructions for ReLU activation layers with DeepBench input shapes. ![](https://i.imgur.com/smdz6fX.png) * ZCOMP data traffic reduction for full networks. ![](https://i.imgur.com/22Pm4Kf.png) * ZCOMP speed-up for full networks. ![](https://i.imgur.com/yA9iqf9.png) * ZCOMP vs. cache compression. ![](https://i.imgur.com/iF2K4xo.png) Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Cache compression options use the Frequent Pattern Compression with Limited Dictionary (FPC-D) algorithm here. * Explicit management of the masks, **storage/retrieval from separate memory locations, and executing more dynamic instructions lead to higher data traffic** for avx512-comp compared to ZCOMP. # 4. Conclusion * In this paper, we introduced ZCOMP vector ISA extensions to mitigate cross-layer memory footprint for CPUs. Targeted for bulk sequential feature map compression, ZCOMP instructions provide high-throughput parallel execution, transparent interaction with cache hierarchy and virtual memory, and a flexible software interface. On full networks, compared to an uncompressed baseline, ZCOMP reduces average memory traffic by 31% and 23% for training and inference, respectively, leading to an average 11% and 3% performance improvements. This performance improvement is much higher than the 4% (training) and -2% (inference) achieved by AVX512 vector extensions. # 5. Discussion * Intel團隊致力於推廣用CPU進行深度學習運算這件事，對我們(的題目)而言是有幫助的 * 我覺得cache compress ratio 提升應該是實作這個方法的附帶效應，但cache的幫助在深度學習應用中感覺微乎其微，所以可能不需要去太深入探討 * 與我們不同的部分是他們以稀疏矩陣為處理目標，而我們並未考慮矩陣稀疏性相關的問題，他考慮的是cross-layer的部分，我認為單層間的運算若使用weight stationary的方法很難去克服稀疏矩陣的問題，也許我們目前可以增加簡單的硬體判斷來跳過輸入為0的運算，並且或許未來可以在我們的memory mapping部分增加額外的編碼來處理0的問題，這樣可以省下SRAM BUFFER的空間，但是向量暫存器還是會有0 * 此篇論文點出CPU向量處理擴增指令對於深度學習的可行性，並且提供了一個cross-layer compression的好方法