Automatic Generation of Efficient Sparse Tensor Format Conversion Routines

# Automatic Generation of Efficient Sparse Tensor Format Conversion Routines ###### origin: PLDI ’20 ###### paper: [link](https://dl.acm.org/doi/pdf/10.1145/3385412.3385963) ## Introduction They propose a technique to generate efficient sparse tensor conversion routines for a wide range of disparate formats, building on their recent works on sparse tensor algebra compilation. ### Research problem How to generate code that efficiently converts sparse tensors between disparate storage formats such as CSR, DIA, ELL, and many others. - To optimize the performance of both data import and compute, an application must convert the tensor from COO to CSR. ### Method They decompose a large class of tensor conversion algorithms into three phases. - [Coordinate remapping](##Coordinate-Remapping) - [Attribute query language](##Attribute-Queries) - [Tensor assembly](##Sparse-Tensor-Assembly) ### Sparse Tensor Format ###### figure 1 ![](https://i.imgur.com/K3JUVhA.png =200x) ![](https://i.imgur.com/By7n2Af.png) Applications typically need to perform different operations on the same tensor, and each operation may require the tensor to be stored in a distinct format for optimal performance. - Example - Importing data into a sparse tensor can be done efficinetly if the tensor is constructed in the COO format or the DOK format. - Computing SpMV with the tensor in the CSR format can be done faster. ## Background - Coordinate hierarchies [Chou et al.](https://dl.acm.org/doi/pdf/10.1145/3276493) describes how tensors stored in disparate can be represented as coordinate hierarchies that have varying structures but that expose the same abstract interface. ![](https://i.imgur.com/otbm46l.png) - This figure shows examples of coordinate hierarchies that represent a tensor in two different formats. - left: COO - right: CSR - Each level in a coordinate hierarchy encodes the nonzeros’ coordinates into one dimension. - Each stored component is represented by a path from the root to a leaf, with coordinates along the path representing the component’s coordinates. - Level formats We can then decompose sparse tensor formats into level formats that each stores a coordinate hierarchy level, which represents a tensor dimension. ![](https://i.imgur.com/ds7sVXE.png) - This figure shows that a CSR format can be decomposed into two level formats, dense and compressed. - The dense level format implicitly encodes all rows using just a parameter N to store the dimension’s size. - The level function locate describes how to efficiently random access coordinates that are stored in a level format. - The compressed level format uses two arrays, pos and crd, to store column coordinates of nonzeros in each row. - pos_bounds and pos_access describe how to efficiently iterate over coordinates, with the former specifying how to access each coordinate. - All level formats, however, expose the same static interface consisting of level functions, which describe how to access a format’s data structures, and properties, which describe characteristics of the data as stored. - e.g. if nonzeros are stored in order. - Generate efficient code - The coordinate hierarchy abstraction lets a compiler generate efficient code to iterate over sparse tensors in disparate formats by simply emitting code to traverse coordinate hierarchies. - This entails recursively generating a set of nested loops that each iterates over a level in a coordinate hierarchy. - The compiler generates each loop by emitting calls to level functions that describe how to efficiently access the level. - To obtain code that iterates over a tensor in any desired format, the level function calls are simply replaced with the desired format’s implementations of those level functions. This approach lets a compiler generate efficient code for disparate formats without hard-coding for any specific format. ## Overview ###### Figure 6 ![](https://i.imgur.com/4hHfOb2.png) - Green:Coordinate remapping; Yellow:Analysis; Blue:Assembly #### [Coordinate remapping phase](##Coordinate-Remapping) - The coordinate remapping phase iterates over the input tensor and, for each nonzero, **computes new coordinates as functions of its original coordinates**. What additional coordinates are computed depends on the target format. Coordinate remapping conceptually transforms (i.e., remaps) the input tensor to a hypersparse higher-order tensor. - (a) lines 2-6 and 20-24: computes a new coordinate k for each nonzero as the difference between its column and row coordinates #### [Analysis phase](##Attribute-Queries) - The analysis phase computes statistics about the input tensor that are later used to **determine the amount of memory** to allocate for storing nonzeros in the target format. The exact statistics that are computed also depend on the target format. - (a) lines 1-8: computes the set of all nonzero diagonals in the input matrix - (b) lines 1-5: computes the maximum number of nonzeros in any row of the input matrix - (c\) lines 1-6: computes the number of nonzeros in each row of the input matrix #### [Assembly phase](##Sparse-Tensor-Assembly) - The assembly phase iterates over the input tensor and inserts each nonzero into the output data structures. Again, where each nonzero is inserted depends on the target format. - (a) computes pB2 as a function of each nonzero’s row coordinate and its offset k (as computed in the coordinate remapping phase), in such a way that nonzeros with the same offset are grouped together in the output (lines 25-26) - (c\) simply appends each nonzero to its row’s corresponding segment in the crd array For each logical phase, they define a language that captures what needs to be performed for disparate target formats. #### How to use? ###### figure 7 ![](https://i.imgur.com/V1RmqkV.png) - **For each new target tensor format**, a user **must** first specify - a coordinate remapping (in green) that, when applied to the input tensor, captures how nonzeros are grouped together and ordered **in the target format**. - **For each level format**, a user **must** then specify - what input tensor statistics to compute (in yellow) - how to store the coordinates of nonzeros (in blue) - To generate code that converts tensors between any two specific formats, their technique combines the aforementioned specifications for the target format with level functions that describe how to iterate over tensors in the source format. - Summary - Given specification like figure (include level functions) as **inputs** - Generate code like [this](######Figure-6) - In this way, their technique can generate efficient conversion routines for many combinations of formats without needing specifications for each combination. #### Advantage - Converting to different tensor formats may require similar steps to be taken for only some of the phases, so having each phase be specified separately allows for **reuse of the specifications**. - e.g. The COO format uses the same data structure as ELL to store column coordinates. Thus, the two formats can share specifications for assembly. - Having each logical phase be specified separately gives the compiler flexibility to generate code that fuses logically distinct phases only if it is beneficial. - e.g. code in [figure](######Figure-6)(a) which duplicates and fuses coordinate remapping with the analysis and assembly phases to avoid materializing the offsets of nonzeros. ## Coordinate Remapping ### Coordinate remapping notation They propose a new language called coordinate remapping notation, which precisely describes how a tensor can be remapped so as to capture the various ways that different tensor formats group together and order nonzeros in memory. Statements in coordinate remapping notation specify how components in a canonical (non-remapped) input tensor map to components in an output tensor of equal or higher order. - The syntax of coordinate remapping notation ![](https://i.imgur.com/LZbmyAi.png) - Example - Given a matrix A as input **(i,j) -> (j-i, i, j)** - Maps every component A~ij~ to the corresponding component in the (j-i)^th^ slice of three-dimensional remapped tensor. - [figure](######figure-7) ### Code generation - To support converting tensors between two specific formats, their technique emits code that iterates over the input tensor and transforms the (canonical) coordinates of each nonzero by applying the target format’s coordinate remapping. - To compute additional coordinates that are defined purely as arithmetic or bitwise expressions of the original coordinates, their technique simply inlines those expressions directly into the emitted code - [Figure](######Figure-6) (a) lines 6 and 24 - Remappings that contain let expressions are lowered by first emitting code to initialize the local variables and then inlining the expressions that use those local variables. - a remapped coordinate r=i/N in (r&1) | ((r&2)<<2) would be lowered to ``` int r = i/N; int m = (r&1) | ((r&2)<<2; ``` - Coordinate remappings that contain counters are lowered by emitting a counter array for each distinct counter in the remapping. - remapping (i,j) -> (#i,i,j) to a COO matrix ``` int counter[N] = {0}; // counter array for #i for (int p = 0; p < nnz; p++) { int i = A1_crd[p]; int j = A2_crd[p]; int k = counter[i]++; // k == #i // map A(i,j) to coordinates (k,i,j) ... ``` ## Attribure Queries To avoid having to constantly reallocate and shuffle around stored nonzeros, many efficient tensor conversion algorithms instead allocate memory in one shot based on some statistics about the input tensor. - If we want to convert a matrix to ELL without dynamically resizing the crd and vals arrays, one must first determine the maximum number of nonzeros K stored in any row. [figure](######figure-1) - input is COO - Computing K requires constructing a histogram that records the number of nonzeros in each row, which in turn requires examining all the nonzeros in the matrix. - input is CSR - Directly computed from the pos array They propose a new language called the attribute query language that describes statistics of sparse tensors as aggregations over the coordinates of their nonzeros. - The attribute query language is declarative, and attribute queries are specified independently of how the input tensor is actually stored. Their technique can thus generate efficient tensor conversion routines while only **requiring users to provide simple-to-specify attribute queries for each potential target format**, as opposed to complicated loop nests for every combination of source and target formats. ### Attribute query language The attribute query language lets users compute summaries of a tensor’s sparsity structure by performing aggregations over the coordinates of the tensor’s nonzeros. - All queries in the attribute query language take the form select [i~1~,...,i~m~] -> <aggr~1~> as label~1~, ..., <aggr~n~> as label~n~ - Different attribute queries computed on the same tensor[figure](######figure-1) ![](https://i.imgur.com/XDdsqIL.png) ### Code generation [Kjolstad et al.](https://dl.acm.org/doi/10.5555/3314872.3314894) introduce concrete index notation, which is a language for precisely specifying tensor computations. - an operation that computes the sum of every row in a matrix A can be expressed as ∀i∀j x~i~ += A~ij~ To generate efficient code that computes an attribute query, our technique simply reformulates the query as sparse tensor algebra computation. - The query is first lowered to a canonical form in concrete index notation, which we extend with the ability to index into results using coordinates that are computed as arbitrary functions of index variables. - The canonical form of the query is subsequently optimized by applying a set of predefined transformations to simplify the computation. - example select [i~1~,...,i~m~] -> id() as Q lower to ∀~j1~ · · · ∀~jn~ Q~i1~ ···~im~ |= map(B~j1~ ···~jn~, 1) - The optimized query in concrete index notation is compiled to imperative code by straightforwardly leveraging the techniques of [Kjolstad et al.](https://dl.acm.org/doi/10.5555/3314872.3314894) and [Chou et al.](https://dl.acm.org/doi/pdf/10.1145/3276493) as summarized above. - This approach works as long as query results are stored in a format, like dense arrays, that can itself be efficiently assembled without needing attribute queries. ## Sparse Tensor Assembly They extend the coordinate hierarchy abstraction with new primitives (level functions) that describe how each level can be efficiently constructe. ### Assembly Abstraction - They assume that coordinate hierarchies can be constructed level by level from top to bottom. - The assembly of each level is decomposed into two logical phases: edge insertion and coordinate insertion. - edge insertion (optional) It models the assembly of data structures that map nonzeros to their containing subtensors. - sequenced: each position (node) in the preceding parent level **can** be iterated in sequence. ![](https://i.imgur.com/6diFx9R.png) - unsequenced: each position (node) in the preceding parent level **cannot** be iterated in sequence. ![](https://i.imgur.com/2bgCDSo.png) - coordinate insertion - The coordinate insertion phase logically iterates over the input tensor’s nonzeros and inserts their coordinates into a coordinate hierarchy. - It models the assembly of data structures that store the coordinates and values of the nonzeros. - They define five level functions ![](https://i.imgur.com/V0btn23.png) - Example ![](https://i.imgur.com/liEgSfl.png) - left: level functions invoked to perform unsequenced edge insertion - right: level functions invoked to perform coordinate insertion ### Code Generation - In order to minimize memory traffic at runtime, our technique emits code that, wherever possible, fuses the assembly of adjacent levels in the result coordinate hierarchy. - Adjacent levels can be assembled together as long as only the parent level requires a separate edge insertion phase (or if none do). ### Implementation example ![](https://i.imgur.com/Zn53Xvg.png) - top: squeezed, it stores the dimension of offsets in remapped DIA tensors. - middle: compressed, it stores the column dimension of CSR tensors as well as the row dimension of COO tensors. - bottom: banded, it stores the column dimension of tensors in the skyline format, which stores all components between the first nonzero and diagonal for every row. ## Evaluation ![](https://i.imgur.com/otMWvlX.png) - Their technique emits code to perform COO to CSR conversion that is **20.4×** faster on average.