Format Abstraction for Sparse Tensor Algebra Compilers

# Format Abstraction for Sparse Tensor Algebra Compilers ###### paper: [link](https://arxiv.org/pdf/1804.10112.pdf) ## Outline - [Introduction](##Introduction) - [Contribution](###Contribution) - [Research Problem](###Research-Problem) - [**Tensor Storage Formats**](##Tensor-Storage-Formats) - [Survey of Tensor Formats](##Survey-of-Tensor-Formats) - [Computing with Disparate Tensor Formats](###Computing-with-Disparate-Tensor-Formats) - [**Tensor Storage Abstraction**](##Tensor-Storage-Abstraction) - [Coordinate Hierarchies](###Coordinate-Hierarchies) - [Level Access Capabilities](###Level-Access-Capabilities) - [Level Properties](###Level-Properties) - [Level Output Assembly Capabilities](###Level-Output-Assembly-Capabilities) - [**Code Generation**](##Code-Generation) - [Background](###Background) - [Property-Based Merge Lattice Optimizations](###Property-Based-Merge-Lattice-Optimizations) - [Level Iterator Conversion](###Level-Iterator-Conversion) - [Level Merge Code](###Level-Merge-Code) - [Code Generation Algorithm](###Code-Generation-Algorithm) - [Evaluation](##Evaluation) - [Conclusion](##Conclusion) ## Introduction - This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). - They develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. ### Research Problem Because the number of combinations is exponential in the number of formats though, it will be difficult to extend by directly hard-coding for them in the code generator. To enable efficiently computing with any format, a tensor algebra compiler must emit distinct code to cheaply iterate over each format. ### Contribution - Levelization They survey many known tensor formats and show that they can be represented as hierarchical compositions of just six per-dimension level formats. - Level abstraction They describe an abstraction for level formats that hides the details of how a level format encodes a tensor dimension behind a common interface - Modular code generation. They present a code generation technique that emits code to efficiently compute on tensors stored in any combination of formats. ## Tensor Storage Formats ### Survey of Tensor Formats ![](https://i.imgur.com/4l0AuQh.png) ### Computing with Disparate Tensor Formats Different formats may use vastly dissimilar data structures to encode tensor coordinates and thus need very different code to iterate over them. To efficiently compute with tensors stored in different combinations of formats can require completely different strategies for simultaneously iterating over multiple formats. - example: ![](https://i.imgur.com/ibd0IjF.png) - To obtain good performance with all tensor formats, **a code generator therefore needs to be able to emit specialized code for any combination of distinct formats.** ## Tensor Storage Abstraction A tensor algebra compiler must generate specialized code for every possible combination of supported formats in order to obtain good performance. The abstraction generalizes common patterns of accessing tensor storage and guides a format-agnostic code generation algorithm. ### Coordinate Hierarchies ![](https://i.imgur.com/OgT61PG.png) - They propose six level formats that suffice to represent all the tensor formats described in Tensor Storage Formats. 1. Dense ![](https://i.imgur.com/AYeTyeP.png) Store the size of the corresponding dimension (N) and encode the coordinates in the interval [0, N). 2. Compressed ![](https://i.imgur.com/q2fhcE0.png) 3. Singleton ![](https://i.imgur.com/De6maK3.png) 4. Range ![](https://i.imgur.com/aMkvbIo.png) 5. Offset ![](https://i.imgur.com/Z8zyrjL.png) 6. Hashed ![](https://i.imgur.com/IIxSd8t.png) - Common tensor formats composed from per-dimension level formats. ![](https://i.imgur.com/cfNfCGX.png) - The precise semantics of the level formats, encoded in level functions that each level format implements and that fully describe how data structures associated with the level format can be interpreted as a coordinate hierarchy level. ![](https://i.imgur.com/nXJMeYI.png) ### Level Access Capabilities Every coordinate hierarchy level provides a set of capabilities that can be used to access or modify its coordinates. Each capability is exposed as a set of level functions with a fixed interface that a level must implement to support the capability. - Figure as shown below identifies the capabilities that each level format supports ![](https://i.imgur.com/XTUF9X9.png) Supported capabilities and properties of each level type. - V - a level type supports **coordinate value iteration** - P - a level type supports **coordinate position iteration** - I - a level type supports **insert** - A - a level type supports **append respectively** - Level capabilities provide an abstraction for manipulating physical indices (tensor storage) in a format-agnostic manner. - The abstract interface to a coordinate hierarchy level exposes three different access capabilities: 1. Coordinate value iteration The coordinate value iteration capability directly iterates over coordinates. 2. Coordinate position iteration The coordinate position iteration capability, on the other hand, iterates over coordinate positions. 3. Locate The locate capability provides random access into a coordinate hierarchy level through a function that computes the position of a coordinate. ![](https://i.imgur.com/rj0f5Le.png) - Every level must provide coordinate value iteration or coordinate position iteration and may optionally also provide the locate capability. - Their code generation algorithm will emit code that calls level functions to access physical tensor storage, which ensures the algorithm is not tied to any specific types of data structures. ### Level Properties Level properties describe characteristics of a level, such as whether coordinates are arranged in order, and are invariants that are explicitly enforced or implicitly assumed by the underlying physical index. Some coordinate hierarchy levels may be configured with a property, depending on the application. Configurable properties reflect invariants that are not tied to how a physical index encodes coordinates. - A coordinate hierarchy level may also declare up to five properties: ![](https://i.imgur.com/OgT61PG.png) 1. Full A level is full if every collection of coordinates that share the same ancestors encompasses all valid coordinates along the corresponding tensor dimension. 2. Unique A level is unique if no collection of coordinates that share the same ancestors contains duplicates. 3. Ordered A level is ordered if coordinates that share the same ancestors are ordered in increasing value, coordinates with different ancestors are ordered lexicographically by their ancestors, and duplicates are ordered by their parents’ positions. 4. Branchless A level is branchless if no coordinate has a sibling and each coordinate in the previous level has a child. 5. Compact A level is compact if no two coordinates are separated by an unlabeled node that does not encode a coordinate. ### Level Output Assembly Capabilities A level may also provide **insert** and **append** capabilities for adding new coordinates to the level. These capabilities let us assemble, in a format-agnostic manner, the data structures that store the result of a computation. - Insert capability & append capability - Insert Inserts coordinates at any position. - Append Appends coordinates to a level. - Definitions of level functions that implement assembly capabilities for various level types. ![](https://i.imgur.com/aGIxh2R.png) ## Code Generation A code generation algorithm that emits efficient code to compute with tensors stored in any combination of formats. The algorithm supports any tensor format expressible as compositions of level formats. ### Background Computing an expression in index notation requires **merging** its operands—that is to say, iterating over the joint iteration space of the operands—dimension by dimension. - Example Code to add sparse matrices must iterate over only rows that have nonzeros in either matrix and, for each row, iterate over components that are nonzero in either matrix. Additive computations must iterate over the union of the operand nonzeros (i.e., a union merge), while multiplicative ones must iterate over the intersection of the operand nonzeros (i.e., an intersection merge). - For matrix add ![](https://i.imgur.com/OuRZa9z.png) - B is a CSR matrix - C is a COO matrix - Output is a dense matrix. (a) The iteration graph for matrix addition with a CSR matrix and a COO matrix as inputs and a row-major dense matrix as output. From this iteration graph, we can determine that computing the operation requires iterating over the row dimension before the column dimension. (b) The merge lattices for matrix addition. They remove any lattice point that does not merge every full tensor dimension (i.e. a dimension represented by a full coordinate hierarchy level) that v indexes into. This figure shows the property-based merge lattice optimization. (c\) The merge lattices for matrix addition. To fully merge B and C’s column dimensions, we would start by running the loop that corresponds to the top lattice point in Figure (c\), computing A~ij~ = B~ij~ +C~ij~ ### Property-Based Merge Lattice Optimizations - They reformulate the optimizations with respect to **properties** and **capabilities** of coordinate hierarchy levels, so that they can be applied to other level types. - In particular, given a merge lattice for index variable *v*, their algorithm removes any lattice point that does not merge every full tensor dimension (i.e., a dimension represented by a full coordinate hierarchy level) that *v* indexes into. - This is valid since full tensor dimensions are supersets of any sparse dimension, so once we finish iterating over and merge a full dimension we must have also visited every coordinate in the joint iteration space. ### Level Iterator Conversion Level iterator conversion **turns iterators over unordered and non-unique levels without the locate capability into iterators with any desired properties.** ![](https://i.imgur.com/YATPXWr.png) - Two types of iterator conversion - Deduplication - Duplicate coordinates complicate merging because it results in code that repeatedly visit the same points in the iteration space. - Deduplication removes duplicates from iterators over ordered and non-unique levels using a deduplication loop. - Reordering - A precondition for code to co-iterate over levels is that coordinates are enumerated in order. - Reordering uses a scratch array to store an ordered copy of an unordered level and replaces iterators over the unordered level with iterators over the ordered copy. - The code generation algorithm can then emit code that merges unordered levels by co-iterating over the ordered copies. ### Level Merge Code Merging dimensions of tensor operands is equivalent to merging the coordinate hierarchy levels that represent those dimensions. The most efficient method for merging levels depends on the properties and supported capabilities of the merged levels. - The asymptotically most efficient strategies for computing the intersection merge depending on whether the inputs are ordered or unique and whether they support the locate capability. ![](https://i.imgur.com/Q0t0hcr.png) These are the same strategies that their code generation algorithm selects. - If neither input vector supports the locate capability, we can co-iterate over the coordinate hierarchy levels that represent those vectors and compute a new output component whenever we encounter nonzero components in both inputs that share the same coordinate. This method can be used in conjunction with the level iterator conversions. - If one of the input vectors, y, supports the locate capability (e.g., it is a dense array), we can instead just iterate over the nonzero components of x and, for each component, locate the component with the same coordinate in y. - We can generalize and combine the two methods described above to compute arbitrarily complex merges involving unions and intersections of any number of tensor operands. ### Code Generation Algorithm - An algorithm that supports many disparate tensor formats and that does not need modification to add support for more level types and tensor formats. - At points in the joint iteration space, the emitted code computes result values and assembles the output tensor by calling relevant assembly capability level functions. The emitted code is then specialized to compute with specific tensor formats by mechanically inlining all level function calls. This approach bounds the complexity of the code generation mechanism, since it only needs to account for a finite and fixed set of level capabilities and properties. - Their algorithm takes as input a tensor algebra expression and recursively calls itself on index variables in the expression, in the order given by the corresponding iteration graph. - At each recursion level, it generates code for one index variable in the input expression. - The algorithm begins by emitting code that initializes iterators over input coordinate hierarchy levels, which entails calling their appropriate coordinate value or position iteration level functions. It also emits code to perform any necessary iterator conversion. - The algorithm additionally constructs a merge lattice at every recursion level for the corresponding index variable in the input tensor algebra expression. This is done by applying the merge lattice construction algorithm proposed by Kjolstad et al. and simplifying the resulting lattice with optimization. - For every point in the simplified merge lattice, the algorithm then emits a loop to merge the coordinate hierarchy levels that represent input tensor dimensions that need to be merged. The subset of merged levels that must be co-iterated by each loop. This is determined by applying the recursive algorithm in [Level Merge Code](###Level-Merge-Code) and the second optimization described in [Property-Based-Merge Lattice Optimizations](###Property-Based-Merge-Lattice-Optimizations). - Within each loop, the generated code dereferences (potentially converted) iterators over the levels that must be co-iterated, making sure to not inadvertently dereference any iterator that has exceeded its ending bound. - The next coordinate to be visited in the joint iteration space, is then computed and used to index into the levels that can instead be accessed with the locate capability. - At the end of each loop iteration, the generated code advances every iterator that referenced the merged coordinate iv (11), so that subsequent iterations of the loop will not visit the same coordinate again. - Within each loop, the generated code must also actually compute the value of the result tensor at each coordinate as well as assemble the output indices. The latter entails emitting code that calls the appropriate assembly capability level functions to store result nonzeros in the output data structures. The algorithm emits specialized compute and assembly code for each merge lattice point. - Fusing Iterators - By default, at every recursion level, the algorithm emits loops that iterate over a single coordinate hierarchy level of each input tensor. - The algorithm implements this optimization by fusing iterators over branchless levels with iterators over their preceding levels. This is legal as long as the fused iterators do not need to participate in co-iteration (i.e., if the other levels to be merged can be accessed with locate). - The algorithm then avoids emitting loops for levels accessed by fused iterators (4), which eliminates unnecessary branching overhead. ![](https://i.imgur.com/qw0bixv.png) - Example ![](https://i.imgur.com/EtPDEas.png) - Iterates over two tensor dimensions with a single loop. ## Evaluation ### Experimental Setup - They used taco with their extension to generate kernels that compute various sparse linear and tensor algebra operations with real-world application, including SpMV (Sparse matrix-vector multiplication) and MTTKRP (Matricized tensor times Khatri-Rao product). ### Sparse Matrix Computations - Normalized execution time of SpMV (y = Ax) with matrix A stored in various formats ![](https://i.imgur.com/dAXJBSn.png) - Normalized execution time of COO SpDM(Sparse matrix-Dense matrix Multiplication) (A = BC, where B is in COO and A and C are dense matrices) ![](https://i.imgur.com/utV0h0K.png) - Normalized execution time of COO matrix addition (A = B + C, where all matrices are stored in the AoS COO format. ![](https://i.imgur.com/btgKPJ1.png) - Normalized execution time of CSR matrix addition, relative to taco for each matrix. ![](https://i.imgur.com/tf0CU6y.png) ### Sparse Tensor Computations ![](https://i.imgur.com/Jvfn9Cl.png) - Execution times of sparse COO tensor algebra kernels in milliseconds. ![](https://i.imgur.com/ZCI9ufu.png) - Figures in parentheses are slowdowns relative to taco. ### Benefits of Supporting Disparate Formats - Support for various sparse tensor formats by taco with our extension (this work) and without their extension as well as by other existing sparse linear and tensor algebra libraries. ![](https://i.imgur.com/230Offp.png) - Normalized execution time of CSR SpMV relative to COO SpMV. ![](https://i.imgur.com/RF2y752.png) - Taking into account the cost of assembling CSR indices for the input matrices. - These results show that computing with CSR is faster than with COO (black line) only if the cost of assembling CSR indices can be amortized over multiple iterations. - Normalized execution time of taco’s DIA SpMV kernel relative to taco’s CSR SpMV kernel. ![](https://i.imgur.com/8Qvu2xE.png) - Storing the input matrix in the DIA format can accelerate SpMV if all the nonzeros in the matrix are confined to a few densely-filled diagonals, but can drastically degrade performance if that is not the case. - Normalized execution time of taco’s CSR SpMV with inputs of varying density and input vectors stored in different formats, relative to the most performant format for each configuration. ![](https://i.imgur.com/jlRN3T1.png) - Each cell shows results for dense arrays (top), sparse vectors (middle), and hash maps (bottom). - Cells are highlighted based on which vector format is most performant (blue for dense arrays, green for sparse vectors, red for hash maps). ## Conclusion - They implemented their technique as an extension to the open-source taco1 tensor algebra compiler. - Their technique’s modularity enables it to support many formats and to be extended to new formats without having to modify the code generator.