Capstan: A Vector RDA for Sparsity

# Capstan: A Vector RDA for Sparsity ###### tags: `Accelerators` ###### paper link: [link](https://arxiv.org/pdf/2104.12760.pdf) ###### slides and video: `none` # 1. Introduction Sparse linear algebra is used for efficient circuit simulation, finite element analysis, machine learning (ML), and more. Similarly, graph analytics provides insights into large, unstructured datasets like web graphs and social networks. With data sizes increasing and CPU performance stalled, new architectures are necessary for sparse computation. RDAs provide a sea of flexible compute and memory resources in a programmable interconnect, enabling dataflow pipelines that exploit parallelism within and between iterations. Most proposed RDAs have focused on primitives benefiting dense workloads, with a vectorized datapath to exploit SIMD parallelism, simple memory system, and short buffers at each node’s input to avoid global pipeline interlocks. However, sparse applications present several challenges to efficient use of SIMD datapaths and on-chip memory bandwidth due to their irregularity and dynamic, data-dependent communication patterns. Ideally, unified sparse dense RDAs should retain efficiency for dense workloads when adding native support for sparse ones. ## Problem Mapping sparsity to dense RDAs is complicated by structural hazards (when accessing memory) and control hazards (when iterating). First, sparse applications frequently use pointer-indexed accesses. Multiple concurrent accesses may point to the same memory bank, even though each bank can only serve one access every cycle. * Current solutions either tolerate low throughput or overprovision banks and crossbars to minimize conflicts. Second, sparse iteration may have inter-loop control dependences when iterating over two sparse dimensions. Using a scalar programming model, each non-zero input would have to be compared before deciding whether to dequeue from one list or another, because each comparison’s result changes the next comparison’s inputs. Scheduling accesses to avoid structural hazards is complicated by positional dataflow, the typical paradigm for RDAs. In positional dataflow, senders and receivers are synchronized, and loop indices are communicated implicitly via the sequence of data elements. A compute graph can thus be mapped to parallel, pipelined execution units without reordering data elements or sending control information across the network. However, sparsity requires moving a memory request from the originating lane to the correct bank; integration with positional dense computation requires that this shuffling is precisely undone. ## Solution ![](https://i.imgur.com/hG4ULgm.png) Capstan, a positional-dataflow, sparse-dense hybrid RDA that is programmable while approaching bespoke ASICs’ performance. Capstan supports all sparse-iteration tensor applications with a low-level map-reduce programming model based on Spatial and a clear path to high level programming. Capstan lessens the impact of memory-bank structural hazards by scheduling on-chip memory accesses over multiple cycles to increase request-level parallelism without increasing crossbar size. A flexible memory-shuffle network then increases memory parallelism beyond the number of SIMD lanes while respecting the positional constraints needed for sparse-dense kernel fusion. Finally, to enable vectorized sparse iteration in the presence of control hazards, Capstan uses a scanner to make multiple control-flow decisions per cycle while still supporting a wide range of applications. ## Contribution * An evaluation of a programmable RDA that supports efficient sparse and dense computation, including a comparison against multiple state-of-the-art baselines. * A memory reordering pipeline that exploits the timing flexibility of a loosely-coupled RDA to increase random-access throughput from 32% to 80%. * Specialized sparse iteration hardware that exploits declarative sparsity’s flexibility to run multiple loop iterations in a single cycle. # 2. Implementation ### THE CAPSTAN ARCHITECTURE ![](https://i.imgur.com/Dg2yDHk.png) 1. dynamically scheduled on-chip sparse access 2. shuffle networks for application outer-parallelization 3. scanners for sparse iteration * Like Plasticine, Capstan is built as a checkerboard grid of compute units (CUs) and memory units (MUs) surrounded by address generators (AGs). * lane count (l = 16) * banks per SpMU (b = 16) * local buffer depth (d = 16) #### 1. Sparse Memory Unit (SpMU) ![](https://i.imgur.com/BBCD7OF.png) chip sparse accesses are handled by sparse memory units (SpMUs), which dynamically schedule sparse requests to banks. * The SpMU’s main architectural component is a reordering pipeline added to Plasticine’s MU. Capstan introduces a scheduled pipeline where d vectors are buffered to stop a single bank conflict from creating a multi-cycle stall. * Dense programs have a fixed non-conflicting lane-bank mapping. * sparse programs have a random mapping, where multiple lanes may request the same bank; this would require a multi-cycle stall while accesses are resolved. 1. every memory access pending in the issue queue first bids for access to its bank 2. an allocator computes a valid crossbar configuration. Following allocation, the SpMU configures its crossbars to route requests from the input queue lanes to banks 3. Each request then enters an independent read-modify-write (RMW) execution pipeline with one SRAM bank and an FPU (Floating Point Unit), which is capable of integer and floating point addition and subtraction along with several bitwise operations 4. Finally, another crossbar inversely permutes the data based on the bank allocation and writes to the output queue. When all requests in a vector complete, the vector is ready to dequeue. #### Allcation 1. The first stage of allocation is hashing input addresses into bank requests and converting them to one-hot vectors (length b). The d one-hot vectors in each lane are bitwise-ORed into a vector with all requested banks. 2. concatenating lanes’ requests forms a lxb request matrix. The request matrix’s rows correspond to banks and columns correspond to lanes: if R(b,l) = 1, then there is at least one request from lane l to bank b. A three-iteration, input-first l×b separable allocator identifies a set of non-conflicting accesses (at most one per lane and one per bank). 3. If a lane in the issue queue contains multiple requests to the same bank (e.g., lane 1 has two requests for bank 3), a per-lane priority encoder grants the oldest one. #### 2. Shuffle Network ![](https://i.imgur.com/0DaxKu9.png) Shuffle networks combine requests between parallelouter-loop iterations while respecting structural hazards and ordering constraints. Capstan has three vertical networks and three horizontal networks: the outer networks connect accesses to DRAM, and the inner networks are used for SRAM accesses. ![](https://i.imgur.com/kMgsqqJ.png) each merge unit 1. takes two vectors of incoming requests 2. tests a single address bit that determines whether they are forwarded to its half or dropped (i.e., intended for the other half) Each CU generates its own vector of addresses, which have superscripts showing the originating CU’s letter (A–D) so they can be tracked through the process. The first merge step does a top-half/bottom-half split, partitioning on Addr[3], and the second merge step partitions on Addr[2]. Each stage tracks the shuffle decisions used to merge vectors and inverts these permutations when replies are returned. #### 3. Spare Loop Headers: Scanner The scanner, which implements sparse loop headers, is a relatively simple block: the key insight is that it requires O(log n) levels of logic, which is less than the O(n) levels that would be required to run arbitrary independent decisions. ![](https://i.imgur.com/tWONJpf.png) ![](https://i.imgur.com/rcfWSJm.png) 1. The bit-vector scanner, which is used for vectorized sparse iteration, starts by computing either the intersection or union of its inputs. 2. In the pipeline, the 16 selected bits are passed into encoders as a 256×16 array to produce the dense indices (j). 3. Finally, the scanner uses these indices to index prefix sums over the inputs and provide indices into compressed input vectors (jA/jB) # 3. Results ![](https://i.imgur.com/3xRXjtE.png) we performed several sensitivity studies using synthesis and microbenchmark simulation to quantify the trade-offs involved in individual units. ![](https://i.imgur.com/9SAY6l8.png) ![](https://i.imgur.com/orVrb6f.png) # 4. Conclusion 1. Capstan, a reconfigurable, vectorized, streaming, scalable sparse accelerator that also supports dense data analytics. 2. Capstan unifies applications and hardware with sparse iteration modes: these permit algorithmic optimizations and hardware optimizations like vectorization. 3. Capstan’s novel bank allocation scheme also provides high random throughput (80%) to distributed sparse memories—more than double that of an arbitrated baseline. 4. With only 16% more area and 12% more power, Capstan brings new applications to Plasticine and runs existing ones 7.6× to 365× faster. 5. Capstan is also significantly faster than CPU, GPU, and several accelerator baselines; the only baseline with better performance is EIE because it is able to store models entirely on-chip, which is impractical for a general-purpose accelerator. # 5. Our work ![](https://i.imgur.com/UkUmdnZ.png) ![](https://i.imgur.com/wyG8THm.png) ![](https://i.imgur.com/rN1r8XK.png) ## Process 1. image in frequency 具有sparsity，原本在image in frequency 設很小的頻率=0的效果很差 2. weight in frequecncy 設很小的頻率=0效果的還不錯 3. 因此能夠將weight的fft前處理成csb格式，同時紀錄非0的indexes，之後將每張image in frequecy看成vector，做SpMV以減少計算量及儲存空間 ## Drawbacks 1. 轉成fft時會有real part & image part，因此儲存空間會有增加 2. sparsity減少的的空間 vs fft所增加的空間 3. 整個model inferece下來，設0的比率會影響accuracy，雖然跳過一些零可以減少計算量來加速，不過影響accuracy彼此也是tradeoff