--- title: A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim tags: Accelerators --- ##### Paper: [A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim](https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/c/332/files/2020/03/scalesim_ispass2020.pdf) ##### Github: [ARM-software/SCALE-Sim](https://github.com/ARM-software/SCALE-Sim) ## Purpose 藉由 SCALE-Sim 模擬出 scale-up 和 scale-out 兩種 DNN Accelerator 架構中 **on-chip memory access**、**runtime**、以及 **DRAM bandwidth** 的數據,並透過數據分析出執行效能(runtime performance)、bandwidth 需求、和能源消耗(energy consumption),以輔助架構設計的決策。 ## Scaling Strategy ### 1. Scale-up 增加 monolithic array 中的 MAC units,提高單次計算的數量 #### 優勢 1. 增加 MAC 可讓 operand data reuse 2. 減少 off-chip memory access 3. 會因為 data reuse 而導致 MAC utilization 受到限制(例如 input data size 比 array size 小,造成其他 MAC 閒置) #### Example google TPU ### 2. Scale-out 將 array 中的 MAC units 切分成不同群組來進行平行計算 #### 優勢 1. data mapping 到 MAC 更彈性,以提高 MAC utilization 2. 在設計和 re-configure 比起 scale-up 的成本較低 #### Example Microsoft Brainwave, Nvidia tesla v100 :::info :information_source: DNN accelerators typically employ a regular array of multiply-accumulate (MAC) units to compute matrix efficiently by leveraging **data reuse** within the array. ::: ## SCALE-Sim 為了分析 scale-up 和 scale-out 的設計差異,作者設計了 **Systolic accelerator simulator** (SCALE-Sim),實現 computing、memory aceess、interface bandwidth 的模擬。 ![](https://i.imgur.com/kbpfkNt.png) ### Implementation Elements 1. 基於 systolic array 的 computer units 2. 三個 [double buffered](https://en.wikipedia.org/wiki/Multiple_buffering) SRAM (two operands and one result) ### Runtime 分析模型 #### Motivation To allow for **fast design space exploration and rapid identification of design insights**, we augment the simulator with an analytical model that captures the first-order execution time of a single systolic array. #### Purpose Determine **the most performant configuration** for both monolithic (scale-up) and partitioned (scale-out) systems for a given workload. :::warning :warning: The analytical model does not consider cycle by cycle accesses and bandwidth demands due to limited memory sizes. ::: 在不考慮 memory access 和 bandwidth 的情況下,根據 runtime model 計算出指定 workload 中最有效的硬體配置,後續再依據此配置來計算出所需要的 cost。 #### Runtime for scale-up ![](https://i.imgur.com/fOLP84f.png) ![](https://i.imgur.com/wmqZBeq.png) $S_{R}$ - Spatial Rows $S_{C}$ - Spatial Columns $T$ - Temporal ##### - Unlimited MAC units :::success $T_{scaleup-min}$ = $2S_{R} + S _{C} + T - 2$ ::: ##### - limited MAC units (**Folding**) ![](https://i.imgur.com/nOVFgra.png =50%x) Folds can be generated by slicing the compute along the SR and SC dimensions. :::success $T_{scaleup-min}$ = $(2R + C + T - 2) \lceil S_{R}/R\rceil\lceil S_{C}/C\rceil$ ::: #### Runtime for scale-out ![](https://i.imgur.com/gLnjlHW.png =50%x) Unlike in scale-up where all the MAC units are arranged in a $R × C$ array, in scaledout configuration, the MAC PEs are grouped into $P_{R} × P_{C}$ systolic arrays, each with a PE array of $R × C$. :::success $T_{scaleup-min}$ = $(2R + C + T - 2) \lceil S'_{R}/R\rceil\lceil S'_{C}/C\rceil$ ::: ### 實作方式 #### 1. 依據 Runtime Model 產生 SRAM traffic traces SCALE-SIM generates cycle accurate read addresses for elements required to be fed on the top and left edges of the array such that the _**PE array never stalls**_. SCALE-SIM generates an output trace for the output matrix, which essentially constitutes the SRAM write traffic. #### 2. 根據 SRAM traffic traces 計算出 computing + SRAM data transfer 的 runtime SCALE-SIM parses the generated traffic traces, to determine total runtime for compute and data transfer to and from SRAM. The SRAM trace also depicts the number of rows and columns that have valid mapping in each cycle. #### 3. 根據 SRAM traffic traces 計算出 fill SRAM buffer 所需的時間,並產生 DRAM request trace SCALE-SIM parses the SRAM traces and determines the time available to fill these buffers such that no SRAM request is a miss. Using this interfaces SCALE-SIM generates a series of prefetch requests to SRAM which we call the DRAM trace. #### 4. 根據 DRAM trace 預估出 bandwidth requirement The DRAM traces are the used to estimate the interface bandwidth requirements for the given workload and the provided architecture configuration. #### 5. 彙整出最終 memory request, computing efficiency 和 high level metrics The trace data generated at the SRAM and the interface level is further parsed to determine the total on-chip and off-chip requests, compute efficiency, and other high level metrics. ## Analysis ### Effect of Aspect Ratio ![](https://i.imgur.com/cmQTcsf.png) (b) 4096 MACs ( c) 16384 MACs 1. the difference in runtime for optimum configuration and others can vary by several orders of magnitude even when the workload is the same, depending on the size of the array. In fact, with larger arrays this difference is exacerbated. 2. For configurations with low array utilization, the runtime of the layer is high, which is expected. Also, runtime generally drops with array utilization. 4. Even though a high utilization is achieved, the improvement in runtime is minimal. This is due to the fact that the time to fill in and take out the data starts dominating. ### Runtime Comparison ![](https://i.imgur.com/nlaFC2C.png) 1. **Partitioned configuration is faster than monolithic configuration** It can be observed that monolithic configurations are sometimes significantly slower (25x for CB2a 1 layer) that partitioned configurations, and never faster that the corresponding partitioned configuration. 2. **Higher folding causes higher runtime** the runtime per fold is directly proportional to the array dimensions. Which explains the trend that the partitioned configurations are always faster. The difference in runtime per layer is amplified if the number of folds are high. ### Cost of scaling out 1. **Loss of spatial reuse** The loss of reuse within the array over short wires also ++leads to longer traversals over an on-chip/off-chip network++ (depending on the location of the partitions) to distribute data to the different partitions and collecting outputs - which in turn can affect overall energy. 2. **Partition number ↑ BW ↑ Runtime ↓** As the number of partitions increase, the runtime goes down, however, BW requirements also rise due to loss of reuse originally provisioned by the internal wires, and increased replication of the data among the partitions, bringing down the effective memory capacity. 3. **MAC units ↑ BW ↑** When scaling to higher number of MAC units, it is interesting to note that the BW requirement is often higher than traditional DRAM BW. For instance, for both Resnet and Transformer layers with 218 MAC units, about 10 KB/cycle of DRAM bandwidth is needed for stall free operation at the sweet spot. 4. **More number of partitions with larger MAC units lead minimum energy** For lower number of MAC units (256, 1024 and 4096), the configuration with minimum energy is the monolithic configuration. However, with increase in number of MAC units, the point of minimum energy moves towards more number of partitions. (scale-out) :::info :information_source: The energy consumption directly depends on: 1. the cycles MAC units have been active 2. the number of accesses to SRAM and DRAM ::: :::info :information_source: The energy saved in by stealing runtime from powering the massive compute array is more significant than the extra energy spent by the loss of reuse. ::: #### :memo: Summary of scale-out cost :::success The data indicates scaling out is ++beneficial for performance++ and with larger MAC units is ++more energy efficient++ that scaling up. However the cost paid is ++the extra bandwidth requirement++ to keep compute units fed, which even at sweet spots are significantly higher than the best scaled-up configuration for large MAC units. ::: ## Conclusions 1. SCALE-SIM provides ++memory accesses and bandwidth requirements++ for various layers of CNN and natural language processing model workloads for varying ++monolithic and partitioned systolic array based configurations++. 2. Depict the inherent ++trade-off space for performance, DRAM bandwidth, and energy++ and identifies the sweet spots within the spaces for different workloads and performance points.