ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

# ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework ###### tags: `paper` ## [Slides](https://drive.google.com/drive/folders/1N-GQqji8LPbVCFME4HX_EMGJopfcdbgN) ## Abstract & Introduction - For efficient embedded deep learning system - Facing resource and power constraints - Need co-design between DNN algorithm, memory hierarchy, and dataflow - A lot of research has been done towards developing specialized hardware accelerators, but ad-hoc and local-optimal resulting from limited design space exploration - DSE: Design Space Exploration - 3 main design spaces: - algorithm space - hardware space - algorithm-to-hardware mapping space - ZigZag: a memory-centric DNN-accelerator-DSE framework - Experiments: Up to 33% more energy-efficient solutions are found by ZigZag ![](https://i.imgur.com/8LL0npI.png) #### Contributions: - Hardware Cost Estimator: memory-centric dataflow representation - Algorithm-to-hardware space: Temporal Mapping Generator - even / uneven > DSE tools typically encompass a temporal mapping generator, a.k.a. auto-scheduler or mapper, to find the optimum temporal mapping scheme for mapping a certain NN layer onto a certain accelerator architecture - Hardware space: Architecture Generator - fully-flexible - -> Leads to better design points than other frameworks > ZigZag uses smarter searching strategies to explore the enlarged design space, reducing the runtime while still locating the global optimum design point as exhaustive search does ## Hardware Cost Estimator - memory-centric dataflow representation - based on an enhanced nested-for-loop format - integrates the information of algorithm, accelerator, and algorithm-to-accelerator spatial & temporal mapping - capture the interaction between dataflow and memory hierarchy - a loop relevance principle - extracts key information like number of memory accesses, required memory bandwidth, etc. - hardware cost integrator - extracting energy and latency #### Memory-Centric Dataflow Representation - captures all memory hierarchy attributes as well as spatial and temporal algorithm-to-hardware mapping schemes ![](https://i.imgur.com/jm3MABx.png) ![](https://i.imgur.com/kEh99b9.png) > mapping AlexNet convolutional layer 2 onto Eyeriss V1 architecture - dataflow information of the three operands - architectural levels are represented from bottom to top (divided by bold lines): MAC units, potential register file and/or SRAM (Global Buffer), up to DRAM - each alphanumeric pair indicates a for-loop, e.g. “K 32” = “for k = 0 to 32-1” - assigning for-loops into different architectural level is loop blocking (loop tiling) - loop reordering - “u” suffix after a loop name indicates spatial unrolling #### Loop Information Extractor Based On the Loop Relevance Principle - simplification and unification - to analyze and extract basic hardware and data attributes ##### Loop Relevance Principle ![](https://i.imgur.com/TWYjD27.png) - looping through ‘r’ loops indicates new data need to be fetched or generated - while looping through ‘ir’ loops creates data reuse opportunities - for a ‘pr’ loop pair, data reuse opportunities arise when the sum of their indices remains constant pr-loop-pair-triggered input data reuse ![](https://i.imgur.com/Oyr2Ph1.png) - for spatial unrolling, inputs can be broadcasted diagonally in a PE array - FYu and OYu - For temporal and spatio-temporal unrollings, data reuse is possible through a FIFO buffer which shifts the input data over consecutive clock cycles - FX and OXu - making the sum of FX and OX a constant in neighboring PE locations across consecutive cycles, enabling to reuse Inputs in a FIFO manner ##### Loop Information Extractor ![](https://i.imgur.com/ZdhIlX6.png) - Li means current memory level - L(i - 1) means one level below the current memory level - Lmin means the lowest memory level ![](https://i.imgur.com/lcTWUD4.png) - Turnaround cycles are number of cycles certain memory can keep operating with the data it contains - Unique unit count: counting how many hardware components that contain different data - Memory access count: “Operand Size” * "total data reuse factor" > how many times each element needs to be accessed repetitively at the current memory level - Required memory bandwidth: the minimum bandwidth that ensures computation happen fluently without stall ![](https://i.imgur.com/9fyBe0a.png) #### Hardware Cost Integrator integrating the extracted technology-independent loop information with the technology-dependent characteristics to estimate the final hardware cost and performance - area: all the used on-chip memory’s area - energy: MAC computation energy and memory access energy - MAC computation energy: total number of MAC * average single-MAC-operation energy - memory access energy: memory access count * memory per-data-access energy - a good dataflow leads to high data access counts at low-cost memories with low data access counts at high-cost memory ![](https://i.imgur.com/tjU3OKl.png) - Latency/Throughput: - tightly related with PE array utilization - PE array’s under-utilization: - Spatial stalls: mismatch between the spatial unrolling dimensions and the neural network layer dimensions - Temporal stalls: mainly from memory bandwidth bottlenecks during computation - calculated by comparing the actual memory bandwidth with the required memory bandwidth - ![](https://i.imgur.com/yJKyhOL.png) ## Temporal Mapping Generator - generate valid temporal mapping points for any type of memory hierarchy - searches for loop schemes - each memory level for each operand can be shared or separated so as to allow maximum flexibility in the hardware design space - three fast search strategies on top of the original exhaustive search - data-stationarity-based heuristic search - data-reuse-based heuristic search - early-cost-evaluation-based iterative search - on even and uneven memory hierarchies that present shared and/or non-shared levels between different operands - even: every for-loop belonged to the same memory level for inputs, weights, and outputs - increases the design space #### Enhanced Loop Blocking Representation - Loop prime factors (LPF): smallest sizes in which a loop can be split, i.e., the atomic blocking sizes that cannot be further divided - basic blocks of the search algorithm - **roofs** and **virtual levels** - roof variable: guide the assignment process - a tuple defined for each operand - memory level index where LPFs are being assigned - maximum amount of relevant prime factors that can still be assigned in the specified memory level ![](https://i.imgur.com/GxoflRn.png) - second level of input memory hierarchy (at index 1) has a storage capacity of 884736 bits, or 884736/16 = 55296 blocks - The prime factors that this level holds are those already assigned and relevant in the levels below ((FX,5),(C,2)) plus those relative to the relevant spatial unrolling below ((FYu,5),(OYu,13),(OYu,2)) that combined correspond to 5 * 2 * (5 + 26 - 1) = 300 blocks - Therefore, the second level can still store 55296/300 = 184 blocks - When no LPF combination is found to fit within the roof of all operands then the smallest roof jumps to the next level available in the hierarchy - virtual memory level separator: in which all operands have equal sets of LPFs #### Exhaustive Search - generates all possible schemes through loop blocking and loop reordering - multiple hours to run the search and evaluate millions of valid mappings for a single layer #### Heuristic Search Based On Data Reuse and Stationarity - extract the data reuse for each operand and level - If a particular combination of loop prime factors causes the data-reuse to be equal to 1 at a specific level - the level is unnecessary since it causes useless memory accesses - discards all mappings with data reuse values equal to 1 for intermediate levels #### Iterative Search Based On Early-Stage Cost Evaluation - Starting from the innermost level of the hierarchy, an iteration step consists in finding the set of LPFs that causes the largest amount of energy savings - it might reach a sub-optimal point since ignores the influence of the upper levels - remarkable speed-up, smaller memory footprint ## Architecture Generator focusing on the auto-generation of all valid memory hierarchies under given area budget autonomously generate all valid memory hierarchies in a search space constrained by area, PE array dimensions and spatial unrolling scheme - Comprehensive Memory Pool Description - storage size: bits - cost of access: pJ - area: um^2^ - several distinct bandwidths - CACTI 7.0 ![](https://i.imgur.com/I9MIvmC.png) ![](https://i.imgur.com/fD5ZtpD.png) ## Validation up to 20%/30% energy savings ## Refs [AI晶片](https://kknews.cc/zh-tw/science/4enqykg.html) [知乎:DNN加速](https://zhuanlan.zhihu.com/p/68042190) [工研院:DNN加速器探索](https://ictjournal.itri.org.tw/content/Messagess/contents.aspx?PView=1&MmmID=654304432061644411&MSID=1001517114142321327) [知乎:加速器相關](https://zhuanlan.zhihu.com/p/88927564) [Pareto optimal](https://zh.wikipedia.org/wiki/%E5%B8%95%E7%B4%AF%E6%89%98%E6%9C%80%E4%BC%98) [loop tiling](https://en.wikipedia.org/wiki/Loop_nest_optimization) [loop optimization](https://engineering.purdue.edu/~milind/ece468/2015fall/lecture-12.pdf) [CNN1](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC5-1%E8%AC%9B-%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E7%B5%A1%E4%BB%8B%E7%B4%B9-convolutional-neural-network-4f8249d65d4f) [CNN2 (李宏毅)](http://violin-tao.blogspot.com/2017/07/ml-convolutional-neural-network-cnn.html) [Conv2D](https://ithelp.ithome.com.tw/articles/10187424) [Conv2D](https://medium.com/@chih.sheng.huang821/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-convolutional-neural-network-cnn-cnn%E9%81%8B%E7%AE%97%E6%B5%81%E7%A8%8B-ecaec240a631)