Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices

# Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices ###### tags: `Accelerators` ###### paper origin: `ICS ’20` ###### paper: [Link](https://tnm.engin.umich.edu/wp-content/uploads/sites/353/2020/08/2020.6.sparse-tpu_ics2020.pdf) ## Introduction Two-dimensional (2D) systolic arrays have been proposed for energy-efficient execution of dense matrix operations [18]. Onestate-of-the-art systolic array solution is Google’s Tensor Processing Unit (TPU) [18, 33]. Its core is a 2D systolic array of 256×256 identical Processing Elements (PEs) that perform 8-bit Multiply-and-Accumulate (MAC) arithmetic. The latest version (v3) has 128×128 elements that supports floating point arithmetic. #### Problem Such sparse algorithms have some disadvantages; they typically use data structures (e.g. compressed sparse row) that are more complex than simple vectors or matrices, which results in <font color = red>**a higher number of memory accesses**</font> per useful computation. Also, because the associated data structures are not easily vectorized, these sparse data structures are not compatible with conventional systolic paradigms. ## Target 1. We propose an algorithm to efficiently pack sparse matrices by merging columns that allows collisions and significantly reduces the number of zero-valued entries mapped to the systolic array. 2. We present a 2D systolic array design that employs conditional execution to efficiently handle sparse matrices with a high degree of sparsity, while achieving TPU-like performance on dense matrices. 3. We evaluate our design on a suite of synthetic sparse matrices, as well as real-world matrices spanning ## Basic Column Packing Algorithm ![](https://i.imgur.com/OWc6WMV.png) The algorithm selects a candidate column and packs it into a group as long as there is no collision between this column and any other column in the packed group. A collision between two columns occurs when they have a non-zero entry at the same row index. When a collision occurs between the next column to be packed and all existing multi-column groups, a new group is created to hold the conflicting column. This process stops once every column has been processed. ## Optimizations For More Efficient Packing #### Partition-wise Packing ![](https://i.imgur.com/bFYRsyD.png) The sparse matrix is first partitioned vertically by the length of the systolic array, then column packing is performed on each partition. Finally, the packed partitions are further divided horizontally into blocks that are mapped on the systolic array #### Collision-aware Packing ![](https://i.imgur.com/0KiL4Ji.png) In the proposed collision-aware scheme, we propose a two-step packing algorithm (Fig. 3a). In the first step, relaxed-collision packing is performed in an iterative way (𝑀𝐴𝑋_𝐶𝑂𝐿𝐿𝐼𝑆𝐼𝑂𝑁_𝑅𝑂𝑊 sets the maximum number of allowed collisions per row of the multicolumn group and 𝑁_𝐼𝑇 𝐸𝑅 indicates the number of iterations that relaxed collision packing method is invoked, and the two parameters are set to one and two, respectively, based on empirical observations in our experiments). The iterations are performed with tighter requirement. We use 15 and 4 as 𝑀𝐴𝑋_𝐶𝑂𝐿𝐿𝐼𝑆𝐼𝑂𝑁_𝐺𝑅𝑂𝑈 𝑃, the maximum number of collisions per group, for the first and the second iterations, respectively. In the first step, the entries that collided in the first iteration are used as additional packing candidates for the second iteration. Even after two iterations, there could still be residual candidate entries. The second step employs conventional collision-free packing to pack these entries. Since the total number of residual entries from the first step is small, they can be packed into a limited number of groups. Fig. 3b shows a simplified illustration of the relaxed-collision method iterated in step 1. The original matrix cannot be packed under the collision-free constraint. As a result of the relaxed constraint, 𝐶0,𝐶1, 𝐶2,𝐶4 and 𝐶3,𝐶5 can be first packed into three intermediate groups 𝑀0, 𝑀1, 𝑀2 and three collided columns 𝑅0, 𝑅1, 𝑅2. These groups can then be packed into 𝐺0,𝐺1,𝐺2. Fig. 3c shows that using collision-aware packing on top of partition-wise packing achieves 1.86× higher packing efficiency than solely adopting partition-wise packing for matrices with moderate (0.05) to high (0.45) densities. ## STPU OPERATION ![](https://i.imgur.com/PvfXvmc.png) The input vector, x, is streamed into the network from the top. The partial results then propagate through the network cycle-by-cycle, starting from the top left corner and proceeding downward and to the right in a diagonal wave as represented by the different colors. Unlike the one-to-one mapping between vector elements to PE columns employed in the TPU, in STPU, the input vector elements need to be grouped and streamed into their corresponding PE column as the sparse matrix is in a packed format ![](https://i.imgur.com/FsXzbux.png) If matrix block i is already loaded in the array, the elements of the vector are input into the array from cycle 0 to cycle 𝑆𝐴𝐻 − 1. Meanwhile, each row of block i+1 is loaded one after another from the left edge of the systolic array. For blocki+1, the vector can be input starting from cycle 𝑆𝐴𝐻 − 1, the time when the first row of block i+1 is fully loaded and propagated to the destination Without loss of generality, we assume the sparse matrix is packed into a 512×256 compacted matrix. On a 128×128 TPU, performing SpMV with the original matrix takes 16 iterations, each of which has a 128-cycle latency. As shown, STPU outperforms the TPU in terms of latency (since the matrix is packed) by performing SpMV in fewer iterations (8 in this example) while incurring minimal overhead from overlapping–see Fig. 5b. ## SpMV with Large Matrices on STPU ![](https://i.imgur.com/fYTMHA1.png) Double buffering improves the throughput by loading and computing simultaneously. For TPU with double buffering, two diagonal streams of input elements comes with a one cycle gap, and the next two streams are scheduled 𝑆𝐴𝐻 cycles later. This then repeats, as shown in Fig. 6b. Without proper input scheduling, a PE may not be able to fetch the correct partial-sum from its left PE, because the partial-sum may be overwritten. 1. to avoid datapath conflicts, an input group 𝐼𝐺𝑛 for block 𝐵𝑖+1 has to follow the previous group 𝐼𝐺𝑛 for block 𝐵𝑖; 2. to maintain correctness, the last element of 𝐼𝐺𝑛 for block 𝐵𝑖+1 should be input after the last element of 𝐼𝐺𝑛+1 for block 𝐵𝑖 streams into the neighboring PE on the right. ## STPU Architecture ![](https://i.imgur.com/NHK8XtD.png) ![](https://i.imgur.com/Do2A24O.png) * Hold: The PE selects the matching vector element from a group of streaming input vector elements. If the input index 𝑋𝑖𝑑𝑥 does not match 𝑊𝑖𝑑𝑥 and 𝑊𝑣𝑎𝑙 is non-zero, the PE retains the value of 𝐴𝑖𝑛. * Latch: When the PE encounters an element that matches the index it held, i.e. when the input index 𝑋𝑖𝑑𝑥 matches 𝑊𝑖𝑑𝑥 , and 𝑊𝑣𝑎𝑙 is non-zero, the input value 𝑋𝑣𝑎𝑙 is copied and stored until the end signal arrives. * Accumulate: The PE starts performing MAC operations after all input elements are streamed through it. When a delayed end signal 𝑋𝐸𝑛𝑑 corresponding to the end of the input vector group arrives, the stored value 𝑋𝑣𝑎𝑙 is multiplied with𝑊𝑣𝑎𝑙 , summed with 𝐴𝑖𝑛 and stores the partial-sum into 𝐴𝑖𝑛 of the rightward PE. * Bypass: Generally, a sparse matrix can not be packed into a completely dense matrix. Thus, the compressed matrix is still sparse and has zero values. To conserve power consumed by a PE in this scenario (𝑊𝑣𝑎𝑙 is zero), we bypass 𝐴𝑖𝑛 from a PE into the 𝐴𝑖𝑛 of its right neighbor, while powering off all the compute elements. ## EVALUATION #### Our proposed partitionwise packing scheme is used in both strategies for its significant reduction in collision probability and improved number of non-zero (NNZ) element density. #### The difference between the two schemes is that: 1. STPU-C0 maintains a mandatory collision-free constraint 2. STPU-C1 relaxes the constraint by using collision-aware packing (detailed in Section 2). ### 5.1 Packing Algorithm Efficiency ![](https://i.imgur.com/NAlh7Sy.png) Fig. 8a and Fig. 8b show the resulting packing efficiency for ultra sparse matrices (i.e. with density ≤0.01) and moderately sparse matrices (i.e. with density ≥0.05) for both STPU and KMZ. With STPU, for the ultra sparse matrices, STPU-C0 and STPU-C1 respectively achieve 144.0× and 143.4× packing efficiency. For the moderately sparse matrices, STPU-C0 and STPU-C1 achieve 4.48× and 6.92× better density, respectively. For the most sparse case (i.e. dimension of 10,000 and density of 0.001), the packing efficiency is as large as 520.2× with the STPU-C0 scheme. :::info STPU-C0 performs slightly better than STPU-C1 while packing ultra sparse matrices, as collisions are rarely seen in these matrices, **STPU-C0 can already reach a high density level without allowing collision, while STPU-C1 unnecessarily relaxes the constraint and creates redundant groups.** ::: On the other hand, when handling moderately sparse matrices,STPU-C1 outperforms STPU-C0. :::info The reason is that even if collisions exist, STPU-C1 enables the packing of columns with limited number of collisions that cannot be packed by STPU-C0. **This difference demonstrates that STPU-C1 is more effective for less sparse matrices.** ::: For the packing algorithm proposed in KMZ , as shown in the figure for the ultra sparse matrices, on average 8.0× packing efficiency is seen, whereas for the moderately sparse matrices 1.07× improvement is achieved. These results illustrate that our proposed packing algorithm outperforms the prior work. ### 5.2 Performance and Energy for SpMV **Scalability Analysis** In the first experiment, we fix the size of the sparse matrices and vary the density, showing energy comparisons between STPU and the TPU running SpMV on sparse matrices of increasing density. ![](https://i.imgur.com/ofmq9kA.png) ![](https://i.imgur.com/Q6LG5OC.png) ![](https://i.imgur.com/5Wkqd0Q.png) ![](https://i.imgur.com/lyJZOJG.png)