Discuss paper submission.
What standardization discussion points do we want to include in the paper? Why was it designed the way it was?
-80/20 rule: What are the most common formats, let's make it really easy to implement parsers for that.
-Reference implementation and easy spec to make implementation design simple
-implement lots of parsers really easily
-Json metadata and 1D vectors to make the spec apply across binary containers and in-memory formats
-Embeddable
-Extensible
-Datatype spec
-iso spec
-!!We could have just zipped the thing, but we wanted to make the format immediately consumable by numeric libraries
Could we add a big table of all the supported formats?
-In memory: pydata/sparse, scipy, pytorch
-HDF5: C, C++, Taco, Finch, scipy, pydata/sparse
-NPZ: scipy, Finch, pydata/sparse
-zarr: scipy, …
Could we add matlab? Could we add graphblas? Could we add sparseblas?
Add table with implementations, line count, etc.
Add Tensor read write times, and tensor file sizes.
Discuss unsortedness, etc. Consensus: fine to add an unsorted flag in the custom formats
-don't want people to expect sorted and get an error
-Only add the flag to the complex custom formats
PRs are submitted to SciPy, PyData Sparse, Finch, etc. Just waiting on meetings to discuss and get them merged.
Top of the priority list for in-memory formats is:
We really want to make sure that these attributes are normalized to reduce complexity. (No windows of windows.)
Willow: we have a few other priorities:
Erik has pushed a PR that can generate diagrams visualizing Binsparse formats, both traditional and custom.
Very useful for those trying to understand how custom formats work. We could add links/diagrams in the README, or even potentially as notes in the spec.
Hameer has made progress on PRs for using Binsparse as an in-memory format in Python. There are PRs for both array API and SciPy. They will be meeting soon to discuss if there are any issues that need to be resolved upstream.
Discussed issue C reference HDF5 parser that has been pointed out by users: builds fail under certain compilers (particularly C++ in some Clang builds) if they don't support complex number literals, which is a compiler extension. Will try to find a workaround.
In-Memory format Update
Sparse BLAS Workshop
In-memory format update: Hameer has submitted PRs adding the new Binsparse in-memory format to PyData Sparse and SciPy. They are interested and on board.
Ben was at the Sparse BLAS workshop last week. It was a productive meeting, and the API and reference implementation are moving along. We are likely to discuss file IO in the near future, and I will propose adopting Binsparse as a supported file format.
Keita from LLNL is working to adapt the Binsparse C++ parser implementation to also be able to use Metall, which allows memory-mapped file IO via C++ allocators. This could potentially be a good basis for distributed reading (and potentially distributed writing, although might require a tile-based abstraction on top of Binsparse as well).
reorganization
keys?number_of_dimensions x number_of_stored_values
.We could support this using an is_contiguous
(or similar) key that specifies whether multi-rank sparse dimensions should be stored contiguously. Recall that custom formats can declare a coordinate format with a sparse level of rank one or greater:
# Custom format equivalent to `COOR`
"custom": {
"transpose": [0, 1],
"level": {
"level_desc": "sparse",
"rank": 2,
"level": {
"level_desc": "element",
}
}
}
We could declare that the row and column dimensions shall be stored contiguously on disk with a new is_contiguous
key.
# Custom format equivalent to `COOR`, but
# row and column indices are stored contiguously
# in one big array.
"custom": {
"transpose": [0, 1],
"level": {
"level_desc": "sparse",
"rank": 2,
"is_contiguous": true,
"level": {
"level_desc": "element",
}
}
}
There would then be only one indices array, with the first number_of_stored_values
values holding the row indices, and the next number_of_stored_values
values holding the column indices.
# Here we have a single indices "tensor," stored
# as a single 1D binary array.
"custom": {
"data_types": {
"indices_0_1": "int64",
"values": "float32"
}
}
There are a few open questions:
indices
array? One option would be to name it the same as the first of the two dimensions that are being squashed. Another would be to give it a combined name like above.
indices
a new, different name? e.g. contiguous_indices
? This would help prevent parsers not expecting contiguous indices from mis-parsing a file with contiguous index arrays.reorganization
key and other changes to support in-memory use cases.Slices(start, stop, step);
Broadcast(pos, len);
CombineDims(dim_list, pos);
SplitDim(pos, len_list);
Transpose(new_order)
Willow: Broadcast
is different from the others in that it represents a lazy operation similar to +
and -
.
What is the primary use case? Is it that we have two libraries, both with lazy broadcast representations, and we want to move data between them with Binsparse?
What is the purpose of transpose?
dense | sparse
, but have rows and columns swapped in their transpose
key). For tensors, it becomes useful to treat your dimension ordering as somewhat arbitrary.Willow: should we have a canonical order of reorganization keys? e.g., split
, transpose
, combine
?
split
, transpose
, and combine
in any ordering.broadcast
might interrupt this.reorganization
key.Resubmit to ICPP
TODOs for the paper:
Discuss paper submission, potential extensions
Discuss plans for adding reorganizations
key to support various kinds of in-memory views.
Discuss adding support for 1-based indexing.
Discuss adding support for CSR4.
Rebuttal is submitted. ACM ICS is a possibility for resubmission.
There are two main avenues for enhancing the paper:
We discussed distributed parsing.
For reading, the primary case is likely reading a normal Binsparse file from a distributed file system (e.g. one downloaded from SuiteSparse Matrix Collection) in a parallel fashion.
This likely will involve one of two techniques:
For writing, things become more complicated. If we want to write things as quickly as possible, we could dump a series of tiles as separate Binsparse files. We would then need a mechanism to read these distributed dumps, which would probably be a collection of Binsparse matrices, either stored as different datasets or files.
However, if we want to store them in a single file, things become more complex. We would likely require an all-to-all operation to compute correct offsets or shift data to the desired processor. Distributed writing itself is also complicated. HDF5 has some support through its MPI interface, but we would likely want something more expressive.
In-memory formats require the ability to store views. The use-case is that we have a view in, say, PyTorch, that transposes or reshapes a matrix. We then want to pass this through to another framework like SciPy, Julia, etc.
The way to support this is by adding a collection of reorganization
keys that provide a view of the matrix in some ways. This information could be ignored by the parser and they would still have a valid matrix (the original matrix before the view was created).
There are a variety of possible keys: strides, reshape, transpose, index maps, and potentially more. Deciding the correct keys to add will likely depend on the use cases and implementation details.
A "reorganized" matrix potentially has less information than a traditional Binsparse matrix. For example, we may not know the number of stored values in the matrix. (Take for example the case of a strided view over a CSR's rows. We would need to add up the number of nonzeros in each row to get the total number of stored values in the view, which is extra computation that's not needed unless we actually need that value.) Index maps also might require some extra work to determine the shape of the reorganized view (although this could probably be figured out by a computer algebra system).
Contributions were not fully articulated
Did not fully support contributions we made
New contributions
(Libraries with a (1) have 1-indexed offsets and indices, while the others are 0-indexed.)
Most of these libraries have support for dense views, as well as things like broadcasting, which are all based on strides and offsets.
For sparse matrices, things are a little sparser:
Two things to consider adding to the spec:
byte
data type, which would allow users to store a series of bytes (e.g. byte[64]
for an array where each element is a collection of 64 bytes). This would allow a user to store an array of structs with something like byte[n]
, where n
is the size of the struct. The user would, however, have to worry about endianness themselves when switching between platforms, so this is not a truly portable solution for user-defined data types.Discussed GraphBLAS bindings
Discussed potential MKL bindings
Discussed Python community RFC for in-memory interchange format, potentially using Binsparse [link]
Discussed performance results benchmarks.
fast_matrix_market
to reference Binsparse HDF5 implementation using COO and CSR for reading/writing.sync
to flush writes to disk after writing. Warm benchmarks read in file once to warm up filesystem cache (this means the file will cached in memory) before reading and do not call sync
after writing, meaning the file may have only been written to memory.fast_matrix_market
can do multi-threaded reading as well.Willow: some PyData collaborators are considering using Binsparse + dlpack as an in-memory interchange format for sparse data between Scipy.Sparse, cuSPARSE, etc. They may join in future, and would likely have additional requests/requirements for Binsparse, since in-memory formats may have additional considerations.
No quorum, short meeting.
What format to use for storing files for distributed/parallel reading/writing?
Ben: I have made progress on C parser, and have converted the entire SuiteSparse Matrix Collection to Binsparse.
COO Binsparse with HDF5's native GZip -9
compression is around 122 GB. (Compared to 155 GB for GZ/Tarball'd Matrix Market.)
Parsing on a single thread is significantly faster than fast matrix market. With multiple threads, fast matrix market does have slightly better performance than a single thread of Binsparse (using GZip -9
, which is perhaps a bit unfair in favor of Matrix Market, as the file is uncompressed).
Performing benchmarks involving a file system is a bit tricky. I am currently flushing the file system cache between experiments, but there are other effects (e.g. SSD cache) that may affect experiments.
Ben hasn't yet been able to find a way to host the SuiteSparse Matrix Collection in Binsparse, but a random selection of matrices is available [https://benjaminbrock.net/random_suitesparse_matrices/].
Isaac on compression:
GZip is super slow. It's the only compression HDF5 ships with, so GZip or uncompressed is often used.
LZ4 is much faster, however, and has codecs for Nvidia GPUs.
Isaac is doing a fair amount of work with TileDB, so if there is some aspect we'd like to investigate, let him know.
Erik: going to Scientific Python summit. i.e., he will be locked in a room in Seattle for several days with various leaders of Python projects. Are there any specific groups he should try to talk to about Binsparse?
number_of_stored_values
PR.Let's not define attributes: reverse PR
We should write up why we decided against attributes (possibly in the spec.)
History lesson: Rutherford-Boeing format went overboard on attributes, and it is not widely used.
Tim: there are a couple of state-of-the-art Matrix Market parsers that are faster
Tim: Matlab would be good to compare against, but this would require CSC.
Should we change how complex numbers are dealt with?
What about storing NNZ (number of stored values)?
number_of_stored_values
or number_of_stored_entries
or number_of_entries
or number_of_explicit_entries
Should we add a key for tensor extensions?
format
is a supported string?bytes[n]
data type
What do we need for 1.0?
- tests and reference output
- C reference implementation
- TACO parser based on ^
- GraphBlas parser based on ^
- Convert MatrixMarket
Yzelman: Re the file formats and parallel I/O — there’s also some recent work by Tim Griesbach called scda, but not yet released. They come from the adaptive mesh angle but have a similar principle of storing binary arrays. No hierarchical support but they do have support for variable-sized arrays, which may come into play sparse matrix side if allowing certain compression schemes that play with the index types (think CSB, CBICRS, etc). Theirs is fully seekable based on header info (and targeting distributed-memory I/O)
Thanks, I’ve noted it=) I/O is super important as in practice we found it almost always forms a significant part of the overall execution, so we spent (and continue to spend) quite a lot of effort in optimising it. But better we have a standard binary format (finally!) to optimise for, rather than everyone doing their own thing— very important effort this one
Discussed attributes PR. General agreement on the wording around attributes. Need some phrasing saying attributes can be ignored, but they must be correct if present. Discussed potential attributes, to list new attributes in GitHub comments in the PR.
number_of_diagonal_elements
contains_cycles
symmetric
, skew_symmetric
, hermitian
, anti-hermitian
, structure_symmetric
invertible
positive_definite
Discussed chunking/tiling. There are essentially two ways of offering tiling, an offset-based approach, and a tile grid-based approach.
m,n
, the indices of tile i,j
have the offset i*m,j*n
.)Erik: This works well, but we don't necessarily want a tile grid that's this rigid. Oftentimes sparse matrices have clusters of values in a small area, so we want to support tiles that are not necessarily uniform in size.
i,j
that maps its indices into a larger sparse matrix. These tiles could possibly overlap, which increases complexity. Without an explicit tile grid, a user would likely need to iterate through all tiles, check for overlap with the submatrix that they want to compute, and then add corresponding nonzeros from any overlapping tiles. This could be nice, as it would allow adding updates in a TileDB-like fashion without modifying the existing matrix. However, it is more complex.Isaac: Reading from HDF5 in parallel is possible with virtual datasets, and there is a proposal to allowing parallel IO in Zaar.
Willow: Do we actually need support for distributed to be in binsparse? We're getting into a lot of details that seem implementation-specific; could we just let the implementation decide how they use binsparse on their own?
Ben: We do need an official standard of some kind if libraries are going to move data back and forth. This standard doesn't necessarily have to be in binsparse, although I think that would be convenient. It is a good idea to get more implementation experience before we set anything in stone, though.
Ben: Should we add an NNZ property to the binsparse JSON? It is not currently part of the specification, with number_of_elements
implicitly referred to.
Willow: What exactly would the NNZ be defined as? In the ISO case does it store 1? What about symmetric?
We then had a long discussion about the possible meanings of NNZ. Albert-Jan Yzelman pointed out that for a symmetric reader, it can be much more efficient if the NNZ refers to the actual number of positions in the matrix with stored values.
Use case: when reading a Matrix Market file, "number of nonzeros" refers to the number of values stored in the Matrix Market file. In the symmetric case, this is not typically the same as the number of positions in the matrix with an entry, due to a single stored value in the triangle being mirrored across the diagonal. When unpacking a symmetric matrix into a non-symmetric storage format, as is commonly done in many sparse matrix libraries, this means you do not know the amount of storage needed until you have read through the matrix to determine the number of elements in the diagonal. The amount of storage needed is number_of_diagonal_elements
+ 2*number_of_triangle_elements
.
Ben: This seems like an important use case we should support. However, it does seem non-intuitive to me that the number of nonzeros listed in the JSON would sometimes not correspond to the number of stored values in binary.
What exactly do we mean by nonzeros? Well, we mean explicitly stored values, which could be zero. We then have a few different possible definitions:
What libraries have support for symmetric matrix types?
Isaac: Python libraries generally don't have support; Scipy.Sparse does not.
Julia does, but seg faults if you ask the number of nonzeros.
Tadd: In PyTorch, there's no way to get the NNZ at all from a sparse matrix.
Willow: Is it too difficult and ambiguous a question to put NNZ as a property in JSON?
What does SuiteSparse Matrix Collection do?
"Pattern entries" is the information that Albert would like here.
Ben: We previously discussed "attributes," which are optional information about the matrix that could be used for optimization. For example, an attribute could indicate that a matrix happens to be symmetric, even when the storage format is not symmetric. An attribute could also indicate that a matrix represents an undirected graph, a graph with no cycles, etc. We could provide number_of_triangle_elements
and number_of_diagonal_elements
as attributes.
Albert: This is sub-optimal, since the parser still has to write the inefficient case. It would be best if all the information to do optimal parsing were always there.
This would help, but would probably also be confusing. It does also require parsers that don't take advantage of this information to compute it if they don't already have it available.
Consensus:
Introduce attributes as a concept. Define number_of_triangle_elements
and number_of_diagonal_elements
as optional attributes.
Provide a note in the specification saying that a quality implementation would be expected to provide the number_of_triangle_elements
attribute for a symmetric format, since this allows important optimizations.
Number of nonzeros property? No consensus yet. It would seem like there are three possible options:
len(colind)
for CSR, what's currently in Matrix Market.)2*(len(colind) - number_of_diagonal_elements) + number_of_diagonal_elements
, the quantity Albert would like and "pattern entries" from Suite Sparse web interface.)Ben's opinion: 1. and 2. are the reasonable options, with 1. being the most intuitive.
Spent most of discussion talking about testing:
h5diff
the results?h5diff
might be too strict
integer
becomes int16
vs int32
vs uint64
. All should be valid.)struct array_t {
void* values = nullptr;
size_t size;
type_t type;
};
struct matrix_t {
array_t values;
array_t pointers_to_1;
array_t indices_0;
size_t nrows = 0;
size_t ncols = 0;
size_t nnz = 0;
format_t format;
};
enum format_t {
COO = 0,
CSR = 1,
CSC = 2
/* ... */
};
struct matrix_t {
format_t format;
void* data;
};
struct coo_t {
array_t values;
};
auto m = read_matrix("my_file.bsp");
if (m.format == COO) {
coo_t coo_m = (coo_t *) m.data;
assert(coo_m.rowind.type == INT_T);
assert(coo_m.colind.type == INT_T);
assert(coo_m.values.type == FLOAT32_T);
int* rowind = (int *) coo_m.rowind.values;
/* ... */
if (coo_m.values.type == FLOAT32_T) {
float* values = coo_m.values.values;
} else if (coo_m.values.type == FLOAT64_T) {
double* values = coo_m.values.values;
}
}
Property graphs
To handle property graphs, we can add a special "element" type to the last level
To handle user-defined types, we will rely on the binary container's support for user-defined types
However, we can also easily allow users to dump an array of structs to disk by creating a byte[n]
type, which would store n
bytes
Consensus: add a feature for storing multiple values of the same type for each value.
uint8[12]
would mean each value is an array of 12 uint8
values.float32[2]
, float64[2]
, etc.Consensus: add a feature for struct of arrays.
generate_reference.jl
can pull any Matrix Market file and convert it to one of the Binsparse formats
We have two main items for the agenda:
Discussed adding a set of (small) matrices to the repository
Discussed chunking
We have two main items for the agenda:
.mtx
files) and 2) text files, which can have arbitrary contents. Generally the text files can be viewed as lists of strings, where each line of the file contains a string, with the line number corresponding to some nonzero. The semantic meaning is dataset specific, however..mtx
file as a Binsparse matrix inside an HDF5 group with the same name as its filename (sans the .mtx
).txt
file as a string dataset inside the root group.pattern
matrices? (As an ISO-valued true
bint8?)user[user_typename
where user_typename
is a user-defined label.HDF5 sparse chunks RFC seems quite different from Binsparse, but likely worth talking with them
Discussed current status
Discussed converting SuiteSparse Matrix Collection
Version numbers?
How to determine which file format to use?
How to nest binsparse or use multiple binsparse in the same file?
What is the minimum viable product for MatrixMarket stuff?
Let's add a directory of more complicated reference binsparse files for formats other than COO.
Let's add a list of parsers on the binsparse homepage!
Do we need a Skew Hermitian format?
More generally, should we describe how parsers should deal with unimplemented features?
Conclusion: we don't need to specify
pointer
live at
pointers_to_level
Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?
Discuss plan for dealing with symmetric
, hermitian
, and skew-symmetric
matrices.
The specific mechanism is fundamentally outside the spec, but we need to support handling this kind of metadata.
We discussed symmetry, and the PR Ben submitted adding symmetric_lower
, symmetric_upper
, and the same for skew-symmetric and Hermitian matrices.
Discussed the possibility of adding a "packed dense symmetric format," which would store a triangle of a dense matrix as a jagged array. Did not reach consensus on whether to add. There was some feeling that this is unnecessary, as we could already store a symmetric dense matrix with symmetric_lower
or symmetric_upper
, but wasting some disk space.
Discussed the possibility of adding an informational "symmetric" flag that does not affect the on-disk storage format, but simply tells the user/reader that the matrix happens to be symmetric, even though it is not necessarily in a symmetric storage format.
Discussed how having an information flag at the same level as the "structure" key might be confusing.
Discussed the possibility of adding an "attributes" or similar key, which would contain a list of informational matrix attributes. This could include an information symmetric flag, as well as attributes like being a positive matrix or having no self-loops when interpreted as a graph.
Discuss number of pointers
and indices
arrays.
Discuss plan for dealing with symmetric
, hermitian
, and skew-symmetric
matrices.
Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?
matrix array symmetric real
, where the entry at (i, j)
is stored at location i + j(j-1)/2
.There's not a huge need to represent actual boolean matrices that are not ISO. We will leave this out of the spec for now, although we have a decent proposal for adding bit packing (see last week's notes). (e.g. bit[int8]
)
We will add bint8
to the spec to represent an unsigned 8-bit integer that represents a boolean value.
We will add complex[x]
to represent a complex number, where x
is a standard type (not complex, please), and values
shall be a size n*2
and i/2
'th element contains the real part of i'th
value and i/2 + 1
contains the imaginary part of the i'th
value.
We will re-merge ISO into the value.
Discussed
Symmetry is needed - likely will store symmetric_left
and symmetric_right
(or some equivalent with better naming.)
Attendees: Willow Ahrens, Benjamin Brock, Jim Kitchen, Isaac Virshup, Erik Welch
General consensus that ISO-valueness being part of the sparse matrix format, as written in ISO-value PR, is the correct decision.
Question: can we support property graphs in v1, or is that completely left to v2?
Consensus: we plan to have some support through user-defined types. There are strategies for this:
Next, we discussed the types laid out in #25. Most discussion centered on how to handle boolean-valued sparse matrices represented as bit arrays. (For instance, a CSR sparse matrix with boolean values, where the values array contains uint64 values, and the i'th element of the values array is defined by the expression (values[i / 64] >> (i % 64)) & 1
).
bint8
, bit[int8]
, or similar format.Attendees: Willow Ahrens, Benjamin Brock, Tim Davis, Jim Kitchen, Isaac Virshup, Erik Welch
Discussed ISO values - should they be a property of the file format,
the type of the values array, or a level in a matrix?
There seems to be consensus that for v2, ISO-ness should be a property of
the level, possibly as a "is_iso" flag.
Ben feels that for v1, ISO-ness should be a property of the format OR value.
We reached consensus around Willow's proposal for v1, that ISO-ness should
be reflected in the format of the matrix.
Willow wrote a draft of a property graph, with support for different properties at the bottom level.
{
"swizzle": [1, 0],
"format": {
"level": "dense",
"rank": 1,
"subformat": {
"level": "sparse",
"rank": 1,
"subformat": {
"level": "multiplex",
"subformats": {
"weight": {
"level": "element",
"value_type": "float32"
}
"capacity": {
"level": "iso",
"value_type": "int32"
}
}
}
}
},
}
Attendees: Willow Ahrens, Tim Davis, Benjamin Brock, Jim Kitchen
Dense
, Sparse
, Element
Sparse
has idx + ptrDense
is implicitCOO
is a bit more complicated
Attendees: Erik Welch, Jim Kitchen, Ben Brock, Isaac Virshup
"1x[{dtype}]"
Attendees: Willow Ahrens, Erik Welch, Tim Davis, Jim Kitchen, Isaac Virshup
inidces_0
, pointers_0
, etc.Attendees: Willow Ahrens, Jim Kitchen, Benjamin Brock, Erik Welch, Isaac Virshup