# Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices
###### tags: `Accelerators`
###### paper origin: `ICS β20`
###### paper: [Link](https://tnm.engin.umich.edu/wp-content/uploads/sites/353/2020/08/2020.6.sparse-tpu_ics2020.pdf)
## Introduction
Two-dimensional (2D) systolic arrays have been proposed for energy-efficient execution of dense matrix operations [18]. Onestate-of-the-art systolic array solution is Googleβs Tensor Processing Unit (TPU) [18, 33]. Its core is a 2D systolic array of 256Γ256 identical Processing Elements (PEs) that perform 8-bit Multiply-and-Accumulate (MAC) arithmetic. The latest version (v3) has 128Γ128 elements that supports floating point arithmetic.
#### Problem
Such sparse algorithms have some disadvantages; they typically use data structures (e.g. compressed sparse row) that are more complex than simple vectors or matrices, which results in <font color = red>**a higher number of memory accesses**</font> per useful computation. Also, because the associated data structures are not easily vectorized, these sparse data structures are not compatible with conventional systolic paradigms.
## Target
1. We propose an algorithm to efficiently pack sparse matrices by merging columns that allows collisions and significantly reduces the number of zero-valued entries mapped to the systolic array.
2. We present a 2D systolic array design that employs conditional execution to efficiently handle sparse matrices with a high degree of sparsity, while achieving TPU-like performance on dense matrices.
3. We evaluate our design on a suite of synthetic sparse matrices, as well as real-world matrices spanning
## Basic Column Packing Algorithm

The algorithm selects a candidate column and packs it into a group as long as there is no collision between this column and any other column in the packed group. A collision between two columns occurs when they have a non-zero entry at the same row index. When a collision occurs between the next column to be packed and all existing multi-column groups, a new group is created to hold the conflicting column. This process stops once every column has been processed.
## Optimizations For More Efficient Packing
#### Partition-wise Packing

The sparse matrix is first partitioned vertically by the length of the systolic array, then column packing is performed on each partition. Finally, the packed partitions are further divided horizontally into blocks that are mapped on the systolic array
#### Collision-aware Packing

In the proposed collision-aware scheme, we propose a two-step packing algorithm (Fig. 3a). In the first step, relaxed-collision packing is performed in an iterative way (ππ΄π_πΆππΏπΏπΌππΌππ_π ππ sets the maximum number of allowed collisions per row of the multicolumn group and π_πΌπ πΈπ indicates the number of iterations that relaxed collision packing method is invoked, and the two parameters are set to one and two, respectively, based on empirical observations in our experiments).
The iterations are performed with tighter requirement. We use 15 and 4 as ππ΄π_πΆππΏπΏπΌππΌππ_πΊπ ππ π, the maximum number of collisions per group, for the first and the second iterations, respectively. In the first step, the entries that collided in the first iteration are used as additional packing candidates for the second iteration. Even after two iterations, there could still be residual candidate entries. The second step employs conventional collision-free packing to pack these entries. Since the total number of residual entries from the first step is small, they can be packed into a limited number of groups. Fig. 3b shows a simplified illustration of the relaxed-collision method iterated in step 1. The original matrix cannot be packed under the collision-free constraint. As a result of the relaxed constraint, πΆ0,πΆ1, πΆ2,πΆ4 and πΆ3,πΆ5 can be first packed into three intermediate groups π0, π1, π2 and three collided columns π 0, π 1, π 2. These groups can then be packed into πΊ0,πΊ1,πΊ2.
Fig. 3c shows that using collision-aware packing on top of partition-wise packing achieves 1.86Γ higher packing efficiency than solely adopting partition-wise packing for matrices with moderate (0.05) to high (0.45) densities.
## STPU OPERATION

The input vector, x, is streamed into the network from the top. The partial results then propagate through the network cycle-by-cycle, starting from the top left corner and proceeding downward and to the right in a diagonal wave as represented by the different colors. Unlike the one-to-one mapping between vector elements to PE columns employed in the TPU, in STPU, the input
vector elements need to be grouped and streamed into their corresponding PE column as the sparse matrix is in a packed format

If matrix block i is already loaded in the array, the elements of the vector are input into the array from cycle 0 to cycle ππ΄π» β 1. Meanwhile, each row of block i+1 is loaded one after another from the left edge of the systolic array. For blocki+1, the vector can be input starting from cycle ππ΄π» β 1, the time when the first row of block i+1 is fully loaded and propagated to the destination
Without loss of generality, we assume the sparse matrix is packed into a 512Γ256 compacted matrix. On a 128Γ128 TPU, performing SpMV with the original matrix takes 16 iterations, each of which has a 128-cycle latency. As shown, STPU outperforms the TPU in terms of latency (since the matrix is packed) by performing SpMV in fewer iterations (8 in this example) while incurring minimal overhead from overlappingβsee Fig. 5b.
## SpMV with Large Matrices on STPU

Double buffering improves the throughput by loading and computing simultaneously. For TPU with double buffering, two diagonal streams of input elements comes with a one cycle gap, and the next two streams are scheduled ππ΄π» cycles later. This then repeats, as shown in Fig. 6b.
Without proper input scheduling, a PE may not be able to fetch the correct partial-sum from its left PE, because the partial-sum may be overwritten.
1. to avoid datapath conflicts, an input group πΌπΊπ for block π΅π+1 has to follow the previous group πΌπΊπ for block π΅π;
2. to maintain correctness, the last element of πΌπΊπ for block π΅π+1 should be input after the last element of πΌπΊπ+1 for block π΅π streams into the neighboring PE on the right.
## STPU Architecture


* Hold: The PE selects the matching vector element from a group of streaming input vector elements. If the input index ππππ₯ does not match ππππ₯ and ππ£ππ is non-zero, the PE retains the value of π΄ππ.
* Latch: When the PE encounters an element that matches the index it held, i.e. when the input index ππππ₯ matches ππππ₯ , and ππ£ππ is non-zero, the input value ππ£ππ is copied and stored until the end signal arrives.
* Accumulate: The PE starts performing MAC operations after all input elements are streamed through it. When a delayed end signal ππΈππ corresponding to the end of the input vector group arrives, the stored value ππ£ππ is multiplied withππ£ππ , summed with π΄ππ and stores the partial-sum into π΄ππ of the rightward PE.
* Bypass: Generally, a sparse matrix can not be packed into a completely dense matrix. Thus, the compressed matrix is still sparse and has zero values. To conserve power consumed by a PE in this scenario (ππ£ππ is zero), we bypass π΄ππ from a PE into the π΄ππ of its right neighbor, while powering off all the compute elements.
## EVALUATION
#### Our proposed partitionwise packing scheme is used in both strategies for its significant reduction in collision probability and improved number of non-zero (NNZ) element density.
#### The difference between the two schemes is that:
1. STPU-C0 maintains a mandatory collision-free constraint
2. STPU-C1 relaxes the constraint by using collision-aware packing (detailed in Section 2).
### 5.1 Packing Algorithm Efficiency

Fig. 8a and Fig. 8b show the resulting packing efficiency for ultra sparse matrices (i.e. with density β€0.01) and moderately sparse matrices (i.e. with density β₯0.05) for both STPU and KMZ.
With STPU, for the ultra sparse matrices, STPU-C0 and STPU-C1 respectively achieve 144.0Γ and 143.4Γ packing efficiency. For the moderately sparse matrices, STPU-C0 and STPU-C1 achieve 4.48Γ and 6.92Γ better density, respectively. For the most sparse case (i.e. dimension of 10,000 and density of 0.001), the packing efficiency is as large as 520.2Γ with the STPU-C0 scheme.
:::info
STPU-C0 performs slightly better than STPU-C1 while packing ultra sparse matrices, as collisions are rarely seen in these matrices,
**STPU-C0 can already reach a high density level without allowing collision,
while STPU-C1 unnecessarily relaxes the constraint and creates redundant groups.**
:::
On the other hand, when handling moderately sparse matrices,STPU-C1 outperforms STPU-C0.
:::info
The reason is that even if collisions exist, STPU-C1 enables the packing of columns with limited number of collisions that cannot be packed by STPU-C0.
**This difference demonstrates that STPU-C1 is more effective for less sparse matrices.**
:::
For the packing algorithm proposed in KMZ , as shown in the figure for the ultra sparse matrices, on average 8.0Γ packing efficiency is seen, whereas for the moderately sparse matrices 1.07Γ improvement is achieved.
These results illustrate that our proposed packing algorithm outperforms the prior work.
### 5.2 Performance and Energy for SpMV
**Scalability Analysis**
In the first experiment, we fix the size of the sparse matrices and vary the density, showing energy comparisons between STPU and the TPU running SpMV on sparse matrices of increasing density.



