Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

# Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores ###### tags: `Accelerators` ###### paper origin: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) ###### papers: [link](https://ieeexplore.ieee.org/document/9251984) ###### slides and video: `none` ###### reference: [Tensor-core](https://hackmd.io/D0dKHBkbR2C8XKr63zRH8w?view) # 1. INTRODUCTION ## Why use tensor core? * Many networks spend **more than 99% of execution time on processing the convolution** operations.(e.g. ResNet, GAN, YOLO) * **Tensor core process GEMM convolution very fast**. ## Convolution in tensor core ### Convolution ![](https://i.imgur.com/Wl9hQAw.png) ### Lowering convolution ![](https://i.imgur.com/5j3dVFT.png) ### GEMM Convolution ![](https://i.imgur.com/DW7HIsO.png) ### GEMM in tensor core ![](https://i.imgur.com/zZUDEqM.png) * Accelerating a convolution using tensor cores follows the methodology of GEMM-based convolution, where multi-dimensional image data (i.e., width, height, channels, batches) are expanded to a large workspace matrix. A pair of tensor cores work on a 16×16 tile of GEMM (i.e., D = A×B+C), and each of four 8-thread octets in a warp produces a quarter of output matrix. A pair of threadgroups in an octet work in tandem using FEDPs to generate an 8×8 tile. ## Research Problems * The underlying operations of tensor cores represent **GEMM calculations**, and **lowering a convolution can effectively exploit the tensor cores** by transforming deeply nested convolution loops into matrix multiplication. However, lowering the convolution has a critical drawback since **it requires a larger memory space** (or workspace) to compute the matrix multiplication, **where the expanded workspace inevitably creates multiple duplicates of the same data stored at different memory addresses**. * **Lowering convolutions**(see the figure above) **provides about 14.5x performance speedup** for ResNet over the simplest direct convolution method. However, the performance speedup is traded **with 8.4x increases in the number of memory transactions** to load the replicates of the same data created during the lowering processes. ![](https://i.imgur.com/5wFZw6d.png) ![](https://i.imgur.com/9YLdRyK.png) ## Proposed Solutions * The proposed **Duplo architecture** tackles this challenge by leveraging compile-time information and microarchitectural supports to **detect and eliminate redundant memory accesses that repeatedly load the duplicates of data in the workspace matrix**. * **Duplo identifies data duplication based on memory addresses and convolution information generated by a compiler. It uses a load history buffer (LHB) to trace the recent load history of workspace data and their presence in register file. Every load instruction of workspace data refers to the LHB to find if potentially the same copies of data exist in the register file**. * **If data duplicates are found, Duplo simply renames registers and makes them point to the ones containing the same values instead of issuing memory requests to load the same data**. # 2. Implementation ## Identify Data ![](https://i.imgur.com/2OsxrGm.png) * The proposed method uses patch ID and element ID to identify data duplication created by the vertical and horizontal striding of a filter. Labeling patches and workspace elements in such a way enables Duplo to distinguish them since duplicates patches or elements are assigned the identical IDs. * A patch ID is given as **patch_id = patch_row_idx * stride_dist + patch_col_idx** * The row index of patch is given as **patch_row_idx = worksp_row_idx / output_height** * The column index is **patch_col_idx = worksp_col_idx /filter_width** * **Worksp_row_idx = array_idx / (filter_width * filter_height)** * **Worksp_col_idx = array_idx % (filter_width * filter_height)** * Use the patch ID as an offset such that **offset = patch_id * input_width*num_channels** * **Element_id = worksp_row_idx % output_width * num_channels * stride_dist + worksp_col_idx % (filter_width * num_channels) + offset** * It is necessary to pair an element ID with a batchID: **batch_id = worksp_row_idx/(output_width * output_height)** * The proposed method generates and **uses a pair of batch_id and element_id to identify the uniqueness** of workspace data. ## DUPLO ![](https://i.imgur.com/fWhTU72.png) * A GPU architecture that resolves the data duplication problem of GEMM-based convolution using tensor cores. * Duplo leverages a compiler support to generate a set of convolution information encompassing dimensions of input data and filters, striding distance, batch size, and starting address of workspace. The information is initially stored in the global memory of GPU and recalled at the launch of a convolution kernel. The size of convolution information generated by the compiler totals only 32 bytes per kernel. * If a tensor-core-load instruction finds the presence of duplicate data by referring to the LHB, Duplo replaces its destination register with the one indicated by the LHB entry and updates the register renaming table accordingly. The use of register renaming scheme enables Duplo to bypass the redundant memory request upon the detection of data duplication. ![](https://i.imgur.com/b0joB03.png) * A detection unit is comprised of an ID generator and load history buffer (LHB). The ID generator calculates a pair of batch and element IDs based on the memory address of a tensor-core-load instruction. The generated IDs are used as tags and also for indexing the LHB whose entries record the history of preceding tensor-core-load instructions and locations of loaded values in register file. * An Example of Duplo Workflow Using Load History Buffer (LHB) to Detect Data Duplication: ![](https://i.imgur.com/Mn32uOD.png) # 3. Result * **Performance improvements** of Duplo with variable-sized LHBs for ResNet [6], GAN [31], and YOLO [33]. The performance improvements are **dependent on the LHB size, and an infinite-sized LHB** (i.e., oracle) delivers **25.9%** performance **speed up** on average. ![](https://i.imgur.com/wTjrX7E.png) it **eliminates about 76% of tensor-core-load instructions** on average and **decreases the number of LDST stall cycles by 24.0%**. * **Hit rate** of direct-mapped LHB with its size varied between 256 and 2,048 entries. The **LHB hit rate is governed by buffer size and convolution parameters**. The LHB **favors large buffer size and shorter striding distance** of convolution filters. **Increasing the number of channels or batch size has a negative impact on the LHB hit rate**. ![](https://i.imgur.com/E7mSikD.png) * **Breakdown** of data services along the memory hierarchy of baseline (B) and Duplo (D) **with an 1024-entry**, direct-mapped LHB. The graph shows which component in the memory hierarchy supplies data. The results show that the **LHB saves a good portion of the DRAM traffic and turns the savings into performance increase and energy reduction**. ![](https://i.imgur.com/J1SUpFS.png) * **Performance impacts of set-associative LHBs**. The results show that the 8-way LHB provides only 3.6% better performance than the direct-mapped LHB. **Since a series of tensor-core-load instructions tend to access sequentially aligned data, neighboring tensor-core-load instructions are well dispersed across disjoint LHB entries.** ![](https://i.imgur.com/jxEUsQD.png) * **Performance implications of variable-sized batches**. Using a large batch proportionally increases the size of workspace. The results show that increasing the batch size from 8 to 32 images decreases the overall performance by 8.2% since **there is no data duplication created across different batch images**. This can be mitigated by employing a larger LHB. ![](https://i.imgur.com/iaz8QeT.png) * Comparison of network-level execution time between the baseline (B) and Duplo (D). **On average, Duplo reduces the execution time for inference and training by 22.7% and 8.3%**, respectively. ![](https://i.imgur.com/qzTw2eU.png) # 4. Report * **Duplo saves a good portion of off-chip DRAM transactions and L2 cache accesses and turns them into in-SM register renaming**, which help improve performance and reduce energy in GPUs. * **A simple direct-mapped buffer is sufficient to filter redundant tensor-core-load instructions because of their sequential access patterns**. The buffer size makes a tradeoff between performance and overhead, and we leave it as a design option. * Increasing the number of channels or batch images does not necessarily hurt the performance of Duplo. **Experiment results prove that Duplo can effectively handle large workspace data**. # 補充 ## Winograd Convolution ![](https://i.imgur.com/1AkkFzY.png) (6*,4+) **=>** ![](https://i.imgur.com/WZwLEiq.png) (4*,8+) (![](https://i.imgur.com/By6dJaF.png)) #### Algo: ![](https://i.imgur.com/Ekv0Hvi.png) ![](https://i.imgur.com/n0FmwHr.png) --- ## FFT #### DFT ![](https://i.imgur.com/3OUtP4A.png) #### 時域卷積 = 頻域相乘 ![](https://i.imgur.com/NuZNIEB.png) --- ## NCHW & NHWC * Organizing convolution data in the order of NCHW is known to be better exploited by conventional CUDA cores, but cuDNN library mandates organizing them in the NHWC order for tensor cores to work