owned this note changed 12 days ago
Published Linked with GitHub

2025-03-19 [Today]

Agenda

Notes

  • Erik has pushed a PR that can generate diagrams visualizing Binsparse formats, both traditional and custom.

  • Very useful for those trying to understand how custom formats work. We could add links/diagrams in the README, or even potentially as notes in the spec.

  • Hameer has made progress on PRs for using Binsparse as an in-memory format in Python. There are PRs for both array API and SciPy. They will be meeting soon to discuss if there are any issues that need to be resolved upstream.

  • Discussed issue C reference HDF5 parser that has been pointed out by users: builds fail under certain compilers (particularly C++ in some Clang builds) if they don't support complex number literals, which is a compiler extension. Will try to find a workaround.

2025-03-05

Agenda

  • In-Memory format Update

  • Sparse BLAS Workshop

Notes

  • In-memory format update: Hameer has submitted PRs adding the new Binsparse in-memory format to PyData Sparse and SciPy. They are interested and on board.

    • They plan to add the specification of the Python data structure to the Array API repository. The specification of the formats and Binsparse JSON descriptor will still reside in the Binsparse repository. This seems reasonable; it's similar to the current split between the Binsparse specification and the convention for a particular binary container (HDF5, Zarr, etc.).
  • Ben was at the Sparse BLAS workshop last week. It was a productive meeting, and the API and reference implementation are moving along. We are likely to discuss file IO in the near future, and I will propose adopting Binsparse as a supported file format.

  • Keita from LLNL is working to adapt the Binsparse C++ parser implementation to also be able to use Metall, which allows memory-mapped file IO via C++ allocators. This could potentially be a good basis for distributed reading (and potentially distributed writing, although might require a tile-based abstraction on top of Binsparse as well).

2025-02-19

Agenda

  • Is there anything else we want to add in terms of reorganization keys?

Notes

  • Discussed PyTorch's COO format, which we want to support for in-memory formats.
    • PyTorch supports an arbitrary dimension COO. Instead of providing row and pointer arrays, the user provides a single tensor, which is of dimension number_of_dimensions x number_of_stored_values.
    • This tensor is internally (we think) converted to a row-major representation.
    • Different implementations may pass this back and forth, and we'd like to support this in a zero copy way.

We could support this using an is_contiguous (or similar) key that specifies whether multi-rank sparse dimensions should be stored contiguously. Recall that custom formats can declare a coordinate format with a sparse level of rank one or greater:

# Custom format equivalent to `COOR`
"custom": {
  "transpose": [0, 1],
  "level": {
    "level_desc": "sparse",
    "rank": 2,
    "level": {
      "level_desc": "element",
    }
  }
}

We could declare that the row and column dimensions shall be stored contiguously on disk with a new is_contiguous key.

# Custom format equivalent to `COOR`, but
# row and column indices are stored contiguously
# in one big array.
"custom": {
  "transpose": [0, 1],
  "level": {
    "level_desc": "sparse",
    "rank": 2,
    "is_contiguous": true,
    "level": {
      "level_desc": "element",
    }
  }
}

There would then be only one indices array, with the first number_of_stored_values values holding the row indices, and the next number_of_stored_values values holding the column indices.

# Here we have a single indices "tensor," stored
# as a single 1D binary array.
"custom": {
   "data_types": {
      "indices_0_1": "int64",
      "values": "float32"
    }
}

There are a few open questions:

  • What should we name the indices array? One option would be to name it the same as the first of the two dimensions that are being squashed. Another would be to give it a combined name like above.
    • What if the dimensions are transposed? Do we also transpose the ordering if we list both dimensions in the index array name?
  • Should we give indices a new, different name? e.g. contiguous_indices? This would help prevent parsers not expecting contiguous indices from mis-parsing a file with contiguous index arrays.

2025-02-05

Agenda

  • Discuss possible reorganization key and other changes to support in-memory use cases.

Notes

  • Hameer proposed five transformations we may want to consider:
Slices(start, stop, step);
Broadcast(pos, len);
CombineDims(dim_list, pos);
SplitDim(pos, len_list);
Transpose(new_order)
  • Willow: Broadcast is different from the others in that it represents a lazy operation similar to + and -.

    • We've chosen not to include other lazy operations, so why broadcast?
  • What is the primary use case? Is it that we have two libraries, both with lazy broadcast representations, and we want to move data between them with Binsparse?

    • Yes, that's the idea.
    • e.g., NumPy has lazy broadcasting (which is read-only).
  • What is the purpose of transpose?

    • It's for re-ordering dimensions of topologically equivalent custom formats (e.g. CSR and CSC are both dense | sparse, but have rows and columns swapped in their transpose key). For tensors, it becomes useful to treat your dimension ordering as somewhat arbitrary.
  • Willow: should we have a canonical order of reorganization keys? e.g., split, transpose, combine?

    • This would ensure that it's straightforward to check for equivalence between to views.
    • It's possibly (perhaps likely) that a canonical order can represent all possible applications of split, transpose, and combine in any ordering.
    • Adding some operations like broadcast might interrupt this.

2025-01-07

Agenda

  • Discuss paper resubmission.
  • Discuss possible reorganization key.

Notes

  • Resubmit to ICPP

  • TODOs for the paper:

    • Incorporate rebuttal material into motivation / introduction
    • Make standard available
    • Add benchmark for tensors?
    • Add MKL sample using Binsparse?

2024-12-09

Agenda

  • Discuss paper submission, potential extensions

  • Discuss plans for adding reorganizations key to support various kinds of in-memory views.

  • Discuss adding support for 1-based indexing.

    • Comment from Tim Davis: "MATLAB makes it look like 1-based to the user but all its sparse matrices are zero based internally. They subtract 1 when going from a MATLAB expression in the m file, to the internal C/C++ which is all zero based."
  • Discuss adding support for CSR4.

Notes

  • Rebuttal is submitted. ACM ICS is a possibility for resubmission.

  • There are two main avenues for enhancing the paper:

    • Add discussion/experiments for sparse tensor parsing (e.g. with Finch/TACO)
    • Implement a distributed parser.
  • We discussed distributed parsing.

  • For reading, the primary case is likely reading a normal Binsparse file from a distributed file system (e.g. one downloaded from SuiteSparse Matrix Collection) in a parallel fashion.

    • This likely will involve one of two techniques:

      • One: Each process reads in 1/p of the data without regard to locality. You then use an all-to-all to copy data to the correct process based on the desired partitioning. This is what CombBLAS does (for both Matrix Market and its custom binary format, as far as I understand.)
      • Two*: Each process reads in a subset of the data in the matrix-only the data that it needs based on the desired partitioning. (This works best with a format like CSR that allows one to efficiently examine only a subset of rows.) The fact that the Binsparse file is stored on a distributed filesystem means that Lustre (or whatever distributed filesystem is used) will provide parallelism.
    • For writing, things become more complicated. If we want to write things as quickly as possible, we could dump a series of tiles as separate Binsparse files. We would then need a mechanism to read these distributed dumps, which would probably be a collection of Binsparse matrices, either stored as different datasets or files.

    • However, if we want to store them in a single file, things become more complex. We would likely require an all-to-all operation to compute correct offsets or shift data to the desired processor. Distributed writing itself is also complicated. HDF5 has some support through its MPI interface, but we would likely want something more expressive.

  • In-memory formats require the ability to store views. The use-case is that we have a view in, say, PyTorch, that transposes or reshapes a matrix. We then want to pass this through to another framework like SciPy, Julia, etc.

  • The way to support this is by adding a collection of reorganization keys that provide a view of the matrix in some ways. This information could be ignored by the parser and they would still have a valid matrix (the original matrix before the view was created).

  • There are a variety of possible keys: strides, reshape, transpose, index maps, and potentially more. Deciding the correct keys to add will likely depend on the use cases and implementation details.

  • A "reorganized" matrix potentially has less information than a traditional Binsparse matrix. For example, we may not know the number of stored values in the matrix. (Take for example the case of a strided view over a CSR's rows. We would need to add up the number of nonzeros in each row to get the total number of stored values in the view, which is extra computation that's not needed unless we actually need that value.) Index maps also might require some extra work to determine the shape of the reorganized view (although this could probably be figured out by a computer algebra system).

  • Quick note about how I did this in BCL. In BCL, I had a custom binary format that supported distributed reading. Essentially, I store the matrix as a CSR, with its binary values, rowptr, and colind arrays written to disk. When reading, I memory map the CSR into memory on each process. I then use a "slice" method to have each process extract the submatrix corresponding to its tile(s) of the matrix. Since the file is stored on a Lustre or other distributed filesystem, reading is efficient as long as different processes read different parts of the file. With this method, each process reads only the rows corresponding to its tile, fast-forwarding through any columns it doesn't need. This was very fast in practice.

2024-12-5

Agenda

  • emergency rebuttal meeting

Notes

  • Contributions were not fully articulated

    • Skipped standardization effort
    • Skipped non-proprietary
  • Did not fully support contributions we made

    • Need an example application where performance matters
      • If the task you need to do is small compared to file read time
    • Need example applications where we store to file in a hot loop
      • Benchmarking code may call different programs in different languages, filesystem is easiest
      • Checkpointing code may store results to disk frequently
      • Unit tests, every time will read and possibly write reference input
      • Visualization sometimes requires storing to disc
      • Here's 5 references of systems that thought binary IO was worth it (and complain about the text format)
    • Need to clarify that it's obvious we can move move data between platforms
    • Need to show that existing frameworks don't store sparse
      • HDF5 and protobuf are both binary containers that provide cross-platform capabilities, neither of which defines how sparse matrices should be stored in them. One person's CSR in HDF5 is not compatible with another persons CSR in HDF5. The common standard is binsparse.
      • Even if a binary container, like Arrow, were to support sparse matrices, other containers like HDF5 wouldn't necessarily recognize them, so there's no standard way to convert the metadata. Binsparse is the missing standard.
  • New contributions

    • Binsparse is the first standard format for sparse matrix metadata in binary containers.
    • Binsparse is faster than any existing formats.

2024-11-25

Agenda

  • Discuss strides
  • Discuss responsibilities of Binsparse as a cross-platform in-memory format

Notes

  • As an in-memory format, which libraries/frameworks would we want to move data between?
    • Matlab (1)
    • Julia (1)
    • Fortran (1)
    • Scipy.Sparse
    • PyTorch
    • Sparse BLAS

(Libraries with a (1) have 1-indexed offsets and indices, while the others are 0-indexed.)

  • Moving data between libraries may require conversion between 0-based and 1-based indexing. (e.g. when moving a Scipy.Sparse matrix to Julia, which does not support 0-based indexing.)
  • We could still support all these matrices in Binsparse by adding 1-based indexing as an optional key. This would at least allow 1-based libraries and frameworks to use Binsparse with their native formats in memory. What other libraries support them and whether conversion between indexing is required would be up to the parser implementation.

Most of these libraries have support for dense views, as well as things like broadcasting, which are all based on strides and offsets.

For sparse matrices, things are a little sparser:

  • Julia supports windowing a sparse matrix, but it does this by just the window offsets and dimensions.
  • Scipy.Sparse doesn't support views; you can retrieve a submatrix, but this will result in a copy.
  • Finch has support for views.
  • MKL supports CSR4, which allows creating a window of a CSR by creating two new offset arrays that point to the beginning and end of each row, respectively. It's unclear if there are high level libraries using this or if there are just some codes doing fancy pointer arithmetic.

Two things to consider adding to the spec:

  • Key for index and offset base
  • CSR4 support.
  • Is there something more general we can do with views? More research into current state-of-the-art required.

2024-10-28

Agenda

  • Discuss proposal for user-defined types

2024-10-14

Agenda

  • Call for opens: paper is submitted, what should be our focus next?
    • In-memory interchange?
    • New features?
      • User-defined types?

Notes

User-Defined Data Types

  • How should we support user-defined data types?
  • In the past we've discussed two possibilities:
    • Introducing a variable-sized byte data type, which would allow users to store a series of bytes (e.g. byte[64] for an array where each element is a collection of 64 bytes). This would allow a user to store an array of structs with something like byte[n], where n is the size of the struct. The user would, however, have to worry about endianness themselves when switching between platforms, so this is not a truly portable solution for user-defined data types.
    • True support would likely require users to provide the details of their user-defined types using a declarative syntax. This could be provided as a JSON structure, thus allowing implementations to choose how they want to deal with the user-defined types. (e.g., there would be a direct correspondence between a Binsparse user-defined type's JSON and an HDF5 user-defined data type. An HDF5 parser implementation would handle the HDF5 bits, such as creating an HDF5 type based on a user-defined data type or resolving the custom data type to an actual C struct, if necessary.)
    • Libraries that we should look into that handle user-defined data types in C include HDF5 as well as cffi.

2024-09-30

Agenda

  • Discuss paper
  • Discuss comments on using Binsparse for in-memory interchange

Notes

  • Discussed using Binsparse as an in-memory exchange format
    • JSON is a potential issue, since parsing may have significant overheads
    • We can specify a C struct, which is the way dlpack works.
    • It would be nice, however, if we could use a binary spec. like BSON that takes care of this for us.
    • Hameer: Python dict might work as an initial implementation.

2024-09-16

Agenda

  • Discuss paper, multi-process performance results.

Notes

2024-08-19

Agenda

  • Discuss performance results / potential paper.

Notes

  • Discussed GraphBLAS bindings

  • Discussed potential MKL bindings

  • Discussed Python community RFC for in-memory interchange format, potentially using Binsparse [link]

  • Discussed performance results benchmarks.

    • Compare fast_matrix_market to reference Binsparse HDF5 implementation using COO and CSR for reading/writing.
    • All single-threaded.
    • Have both cold and warm cache benchmarks. Cold benchmarks drop filesystem cache before reading and call sync to flush writes to disk after writing. Warm benchmarks read in file once to warm up filesystem cache (this means the file will cached in memory) before reading and do not call sync after writing, meaning the file may have only been written to memory.
    • Binsparse is significantly faster almost all the time. When writing compressed, there is still a mean speedup, but Binsparse is sometimes slower. This is to be expected, as compressed writing is more costly than compressed reading.
    • Willow: should consider logarithmic x-axis where x coordinates are the Matrix Market file sizes. This would better show performance trends.
    • Ben: these experiments are all single-threaded. Would be good to have multi-threaded performance benchmarks, as fast_matrix_market can do multi-threaded reading as well.
  • Willow: some PyData collaborators are considering using Binsparse + dlpack as an in-memory interchange format for sparse data between Scipy.Sparse, cuSPARSE, etc. They may join in future, and would likely have additional requests/requirements for Binsparse, since in-memory formats may have additional considerations.

2024-08-05

Agenda

  • Discuss briefly performance experiments.

Notes

  • Looked at file size and (in progress) read benchmarks.
  • Binsparse has average ~2x file size reduction for COO/CSR uncompressed, ~7x file size reduction for COO/CSR compressed.
  • Reading numbers look good, although Ben is currently re-doing experiments on his local system to avoid cache effects observed in cluster. (Need root to flush filesystem cache.)
  • Isaac: Histogram of speedups would be nice.
  • Isaac: lz4 and zstandard compression are much faster and have some support through an hdf5 plugins package in Python
  • Tim: Binsparse C bindings are mostly set up for integration with SuiteSparse, but need support for custom memory allocation. e.g. by setting global function pointers for malloc/free.

2024-07-15

No quorum, short meeting.

2024-07-01

Agenda

  • Multi-threaded parsing and HDF5 vs. Zarr

Notes

What format to use for storing files for distributed/parallel reading/writing?

  • Isaac: we could use a hierarchical format, where we essentially store an array of Binsparse matrices where each matrix represents a particular tile or submatrix of the overall matrix.
    • This will work well for many sparse datasets, but for some funky graph datasets, this won't work well, since there will be load imbalance between the blocks.
  • Willow: we could use a format where the composition of multiple matrices forms the whole matrix.

2024-06-17

Agenda

Notes

  • Looked at file size benchmarks
  • Willow:
    • would be useful to see x-axis as NNZ or MTX file size instead of index, so that we can see a clearer trend
    • Box and whisker plot
    • Sort by MTX size instead of index to avoid jaggedness of plot?
    • What actually accounts for bumps?
      • ISO-ness
      • Different value types
      • Provenance of the dataset, i.e. some datasets have data that is representable with only a few characters in ASCII, but is an odd number of floating point (e.g. "0.1" in ASCII)

2024-05-28

Agenda

  • Discuss progress toward benchmarks
    • Benchmarks implemented for reference C implementation
    • However, there are some difficulties with performing objective benchmarks involving a file system.

Notes

  • Ben: I have made progress on C parser, and have converted the entire SuiteSparse Matrix Collection to Binsparse.

  • COO Binsparse with HDF5's native GZip -9 compression is around 122 GB. (Compared to 155 GB for GZ/Tarball'd Matrix Market.)

  • Parsing on a single thread is significantly faster than fast matrix market. With multiple threads, fast matrix market does have slightly better performance than a single thread of Binsparse (using GZip -9, which is perhaps a bit unfair in favor of Matrix Market, as the file is uncompressed).

  • Performing benchmarks involving a file system is a bit tricky. I am currently flushing the file system cache between experiments, but there are other effects (e.g. SSD cache) that may affect experiments.

    • Isaac: some people run experiments in a round-robin or randomized fashion, so that experiments on the same dataset are not consecutive. This can push the data out of cache to reduce these effects for a more realistic result.
  • Ben hasn't yet been able to find a way to host the SuiteSparse Matrix Collection in Binsparse, but a random selection of matrices is available [https://benjaminbrock.net/random_suitesparse_matrices/].

Isaac on compression:

  • GZip is super slow. It's the only compression HDF5 ships with, so GZip or uncompressed is often used.

  • LZ4 is much faster, however, and has codecs for Nvidia GPUs.

  • Isaac is doing a fair amount of work with TileDB, so if there is some aspect we'd like to investigate, let him know.

  • Erik: going to Scientific Python summit. i.e., he will be locked in a room in Seattle for several days with various leaders of Python projects. Are there any specific groups he should try to talk to about Binsparse?

    • Might be useful to talk to someone about ML: what would it take to make Binsparse useful for storing sparse tensor with something like Pytorch?
    • Isaac: Geospatial people keep expressing interest in a sparse format, although it is unclear exactly what they want.

2024-05-13

Agenda

  • Do we really want to define attributes in the spec? Willow's comment on #49.
  • Discuss number_of_stored_values PR.
  • Discuss progress on implementations / SuiteSparse Matrix Collection conversion.
    • SuiteSparse converted to Binsparse/HDF5. Single file, COO with chunking/GZ compression.
    • 125 GiB for SuiteSparse MC vs. 180 GiB for GZip'd tarballs of Matrix Market.
    • Some other optimizations left to play around with:
      • Shuffling (HDF5 feature that shuffles bits: helps compression, but potentially makes reading slower)
      • Other formats (CSR, CSC, etc.)
    • Still need to evaluate parsing time (need state-of-the-art Matrix Market reader to compare against)

Notes

  • Let's not define attributes: reverse PR

  • We should write up why we decided against attributes (possibly in the spec.)

    • There are essentially two kinds of "attributes:"
      • Attributes about that data that, while not necessary to parse the data, do need to be understood by the parser in order to take advantage of them. e.g. the number of triangular values, which might give you a hint about how much memory to allocate if you're reading a symmetric matrix from disk while writing it into a non-symmetric in-memory format.
      • Information about the matrix that, while useful, does not need to be known by the parser. This can be provided/examined by the user. (e.g. singular values, Bacon number, etc.) We already support this through user-provided JSON and nested Binsparse matrices in HDF5 groups.
    • We reached consensus that we do not currently have any compelling attributes that must be understood by the parser. While number of triangular values could be used, we no longer find it a compelling enough use case to justify defining attributes in the spec. (We have changed our minds in this regard.) If someone comes along with another compelling set of attributes, we will reconsider.
  • History lesson: Rutherford-Boeing format went overboard on attributes, and it is not widely used.

  • Tim: there are a couple of state-of-the-art Matrix Market parsers that are faster

  • Tim: Matlab would be good to compare against, but this would require CSC.

2024-04-29

Agenda

  • Adding a key for tensor extensions.
  • Discuss reference issues

Notes

  • Should we change how complex numbers are dealt with?

    • No, we made the right decision.
  • What about storing NNZ (number of stored values)?

    • Consensus: add key as number_of_stored_values or number_of_stored_entries or number_of_entries or number_of_explicit_entries
  • Should we add a key for tensor extensions?

    • Willow: you can just test whether format is a supported string?
  • bytes[n] data type

2024-04-18

What do we need for 1.0?
- tests and reference output
- C reference implementation
- TACO parser based on ^
- GraphBlas parser based on ^
- Convert MatrixMarket

2024-04-01

Agenda

  • SparseBLAS Comments:
    • What about an offset to the arrays for multi-processor chunking
    • What about parallelism? Blocking, seek
    • add nnz to the header if it's not there already
    • Add bytemap data:

Yzelman: Re the file formats and parallel I/O — there’s also some recent work by Tim Griesbach called scda, but not yet released. They come from the adaptive mesh angle but have a similar principle of storing binary arrays. No hierarchical support but they do have support for variable-sized arrays, which may come into play sparse matrix side if allowing certain compression schemes that play with the index types (think CSB, CBICRS, etc). Theirs is fully seekable based on header info (and targeting distributed-memory I/O)
Thanks, I’ve noted it=) I/O is super important as in practice we found it almost always forms a significant part of the overall execution, so we spent (and continue to spend) quite a lot of effort in optimising it. But better we have a standard binary format (finally!) to optimise for, rather than everyone doing their own thing— very important effort this one

Notes

Discussed attributes PR. General agreement on the wording around attributes. Need some phrasing saying attributes can be ignored, but they must be correct if present. Discussed potential attributes, to list new attributes in GitHub comments in the PR.

  • number_of_diagonal_elements
  • contains_cycles
  • Symmetry - do we want a symmetric attribute? symmetric, skew_symmetric, hermitian, anti-hermitian, structure_symmetric
  • invertible
  • positive_definite

Discussed chunking/tiling. There are essentially two ways of offering tiling, an offset-based approach, and a tile grid-based approach.

  • In a tile grid-based approach, a matrix is partitioned into a series of tiles in a tile grid. Each tile is then stored as a small matrix. Given a specific set of tile coordinates, a user can access a tile in constant time. There is an implicit offset for each tile based on the tile shape and its tile coordinates. (Given a tile size m,n, the indices of tile i,j have the offset i*m,j*n.)

Erik: This works well, but we don't necessarily want a tile grid that's this rigid. Oftentimes sparse matrices have clusters of values in a small area, so we want to support tiles that are not necessarily uniform in size.

  • In an offset-based approach, a matrix consists of a series of smaller matrices, each of which has an offset i,j that maps its indices into a larger sparse matrix. These tiles could possibly overlap, which increases complexity. Without an explicit tile grid, a user would likely need to iterate through all tiles, check for overlap with the submatrix that they want to compute, and then add corresponding nonzeros from any overlapping tiles. This could be nice, as it would allow adding updates in a TileDB-like fashion without modifying the existing matrix. However, it is more complex.

Isaac: Reading from HDF5 in parallel is possible with virtual datasets, and there is a proposal to allowing parallel IO in Zaar.

Willow: Do we actually need support for distributed to be in binsparse? We're getting into a lot of details that seem implementation-specific; could we just let the implementation decide how they use binsparse on their own?

Ben: We do need an official standard of some kind if libraries are going to move data back and forth. This standard doesn't necessarily have to be in binsparse, although I think that would be convenient. It is a good idea to get more implementation experience before we set anything in stone, though.

2024-03-19

Agenda

  • Add nnz to the header if it's not there already

Notes

Ben: Should we add an NNZ property to the binsparse JSON? It is not currently part of the specification, with number_of_elements implicitly referred to.

Willow: What exactly would the NNZ be defined as? In the ISO case does it store 1? What about symmetric?

We then had a long discussion about the possible meanings of NNZ. Albert-Jan Yzelman pointed out that for a symmetric reader, it can be much more efficient if the NNZ refers to the actual number of positions in the matrix with stored values.

Use case: when reading a Matrix Market file, "number of nonzeros" refers to the number of values stored in the Matrix Market file. In the symmetric case, this is not typically the same as the number of positions in the matrix with an entry, due to a single stored value in the triangle being mirrored across the diagonal. When unpacking a symmetric matrix into a non-symmetric storage format, as is commonly done in many sparse matrix libraries, this means you do not know the amount of storage needed until you have read through the matrix to determine the number of elements in the diagonal. The amount of storage needed is number_of_diagonal_elements + 2*number_of_triangle_elements.

Ben: This seems like an important use case we should support. However, it does seem non-intuitive to me that the number of nonzeros listed in the JSON would sometimes not correspond to the number of stored values in binary.

What exactly do we mean by nonzeros? Well, we mean explicitly stored values, which could be zero. We then have a few different possible definitions:

  • Actual number of values stored in memory
  • Number of positions in the abstract "matrix" which have a stored value. (In symmetric case, includes mirrored triangle values.)

What libraries have support for symmetric matrix types?

Isaac: Python libraries generally don't have support; Scipy.Sparse does not.

Julia does, but seg faults if you ask the number of nonzeros.

Tadd: In PyTorch, there's no way to get the NNZ at all from a sparse matrix.

Willow: Is it too difficult and ambiguous a question to put NNZ as a property in JSON?

What does SuiteSparse Matrix Collection do?

  • Matrix Market reports number of matrix tuples stored in the Matrix Market file.
  • Web interface reports something different:
    • "Pattern entries," the number of positions in the matrix with an element (including explicit zeros)
    • "Nonzeros," the number of numerically nonzero elements in the matrix (pattern entries - explicit zeros)

"Pattern entries" is the information that Albert would like here.

Ben: We previously discussed "attributes," which are optional information about the matrix that could be used for optimization. For example, an attribute could indicate that a matrix happens to be symmetric, even when the storage format is not symmetric. An attribute could also indicate that a matrix represents an undirected graph, a graph with no cycles, etc. We could provide number_of_triangle_elements and number_of_diagonal_elements as attributes.

Albert: This is sub-optimal, since the parser still has to write the inefficient case. It would be best if all the information to do optimal parsing were always there.

This would help, but would probably also be confusing. It does also require parsers that don't take advantage of this information to compute it if they don't already have it available.

Consensus:

  • Introduce attributes as a concept. Define number_of_triangle_elements and number_of_diagonal_elements as optional attributes.

  • Provide a note in the specification saying that a quality implementation would be expected to provide the number_of_triangle_elements attribute for a symmetric format, since this allows important optimizations.

  • Number of nonzeros property? No consensus yet. It would seem like there are three possible options:

  1. Number of elements explicitly stored in the matrix format (len(colind) for CSR, what's currently in Matrix Market.)
  2. Number of elements with explicit entries in the matrix (2*(len(colind) - number_of_diagonal_elements) + number_of_diagonal_elements, the quantity Albert would like and "pattern entries" from Suite Sparse web interface.)
  3. Number of elements with zero values in the matrix (doesn't count explicit zeros, "nonzeros" from Suite Sparse web interface.)

Ben's opinion: 1. and 2. are the reasonable options, with 1. being the most intuitive.

2024-03-04

Agenda

Notes

  • What do we need to do for sparsity support?

2024-02-20

Notes

  • Discussed status of Ben's binsparse C parser
    • Binsparse reading/writing is essentially fully functional
    • Still working on some aspects of Matrix Market parsing (e.g. complex)

Spent most of discussion talking about testing:

  • What kind of testing do we want?
  • Is Matrix Market -> Binsparse enough? Can we simply h5diff the results?
  • h5diff might be too strict
    • Different interpretations of comments?
    • Different interpretations of types? (e.g. integer becomes int16 vs int32 vs uint64. All should be valid.)
  • What about reading in a Matrix Market file, then writing the results?
    • Might have some bugs. e.g. what if you don't handle transpose properly internally, but end up outputing the correct output by just carrying through the transpose flag?
  • We could have the testing infrastructure do some simple operation, such as a conversion between formats or a matrix operation (elementwise, squaring, etc.)
    • Implementing these operations is going to be complex for a parser
    • BUT: testing infrastructure doesn't need to be limited to parser. Can use a matrix library like SuiteSparse.

2024-02-05

Agenda

  • How do we spell out properties (e.g. at the element level, to define a struct of arrays)?
  • Where are we at?
    • 1.0
      • spec is done (barring any minor adjustments as we implement stuff)
      • Need to finish converting matrices
        • Converting the matrices doesn't stress all of the different formats and datatypes
      • Need to make c bindings
      • Paper: arxiv or joss?
        • wanna get on hacker news
        • get it into the graphblas spec
      • SparseBlas
      • Testing (at least we should be able to interoperate with each other)
struct array_t {
  void* values = nullptr;
  size_t size;
  type_t type;
};

struct matrix_t {
    array_t values;
    array_t pointers_to_1;
    array_t indices_0;
    
    size_t nrows = 0;
    size_t ncols = 0;
    size_t nnz = 0;
    
    format_t format;
};

enum format_t {
    COO = 0,
    CSR = 1,
    CSC = 2
        /* ... */
};


struct matrix_t {
    format_t format;
    void* data;
};

struct coo_t {
    array_t values;
};

auto m = read_matrix("my_file.bsp");

if (m.format == COO) {
    coo_t coo_m = (coo_t *) m.data;
    
    assert(coo_m.rowind.type == INT_T);
    assert(coo_m.colind.type == INT_T);
    
    assert(coo_m.values.type == FLOAT32_T);
    
    int* rowind = (int *) coo_m.rowind.values;
    /* ... */
    
    if (coo_m.values.type == FLOAT32_T) {
        float* values = coo_m.values.values;
        
    } else if (coo_m.values.type == FLOAT64_T) {
        double* values = coo_m.values.values;
        
    }
}

2023-12-11

Agenda

  • Support for property graphs / graphs with multiple sets of edges between vertex pairs
    • Possible solutions:
      • Willow: add an extra dimension for each type of edge
      • User-defined datatype
  • Full treatment of user-defined datatypes

Notes

  • Property graphs

    • Graphs where edges can have multiple properties (e.g. height, length, width)
    • Nodes may also be grouped, where each node type has the same properties
  • To handle property graphs, we can add a special "element" type to the last level

    • This will essentially state that there are multiple values for each stored nonzero
    • These values will have names and be stored in separate arrays by type
  • To handle user-defined types, we will rely on the binary container's support for user-defined types

    • e.g., create an HDF5 type for your struct
  • However, we can also easily allow users to dump an array of structs to disk by creating a byte[n] type, which would store n bytes

  • Consensus: add a feature for storing multiple values of the same type for each value.

    • e.g.: uint8[12] would mean each value is an array of 12 uint8 values.
    • complex is now kind of an alias to float32[2], float64[2], etc.
  • Consensus: add a feature for struct of arrays.

2023-11-27

Agenda

  • Overview of v2

Notes

  • Binsparse: Willow's generate_reference.jl can pull any Matrix Market file and convert it to one of the Binsparse formats
    • Limitation: doesn't handle comments (just not implemented yet)
  • Converting SuiteSparse Matrix Collection: what format to use?
    • Proposal: store as COO (COOR), which makes sense as an interchange format
  • Topics for 2.0 next time:
    • Tensor Symmetry
    • Unordered indices
    • Duplicate indices
    • Chunking
    • Is there anything else you want to see?

2023-10-30

Agenda

We have two main items for the agenda:

  1. Discuss implementations, best mechanism for sharing matrices. (Should we create a GitHub?)
  2. Discuss V2.

Notes

  • Discussed adding a set of (small) matrices to the repository

  • Discussed chunking

    • We could naively chunk the binsparse datasets, but this likely does not do everything we want
      • e.g. for CSR format, the way matrix is chunked will not necessarily be meaningful
      • We will likely end up with nonzeros from different parts of the matrix
    • We would preferably chunk the matrix spatially
    • This is essentially many small Binsparse matrices
    • Willow: we could also use our tensor representation to represent chunking, where the higher dimensions represent splitting the matrix into chunks.

2023-10-16

Agenda

We have two main items for the agenda:

  1. Discuss implementations, best mechanism for sharing matrices. (Should we create a GitHub?)
  2. Discuss V2.
  • HDF5 binsparse format currently stores metadata inside a "binsparse" key in a string of JSON text. This text is stored inside an attribute named "binsparse" in an HDF5 group. Is this too redundant, and is this what we want?

Notes

  • Discussed converting SuiteSparse Matrix Collection to Binsparse format.
    • There are two types of files in the SuiteSparse Matrix Collection: 1) Matrix Market files (including some vectors stored in .mtx files) and 2) text files, which can have arbitrary contents. Generally the text files can be viewed as lists of strings, where each line of the file contains a string, with the line number corresponding to some nonzero. The semantic meaning is dataset specific, however.
  • We should be able to convert all the SuiteSparse Matrix Collection matrices to a single Binsparse HDF5 file programmatically by
    1. Storing each .mtx file as a Binsparse matrix inside an HDF5 group with the same name as its filename (sans the .mtx)
    2. Storing each .txt file as a string dataset inside the root group.

2023-09-25

Agenda

  • file extension ideas: .bsp.h5?
  • If the complex type modifier only applies to floats, can we just handle it the same way we handle bint8, rather as some complicated "modifier" structure?
  • Can we add a run-length-encoding level type called "repeated"?

Notes

  • We want to store the JSON blurb as an attribute using an HDF5 string
  • Discussed how to set up automated testing of our implementations

2023-09-11

Agenda

  • We are missing named formats for dense matrix and dense vector
    • We decided on CVEC, DVEC, DMAT, DMATR, DMATC
    • Willow will work on a PR
  • Can we specify that the fill_value array also has a corresponding fill_value_type entry?
    • This is already specified, we will leave it alone
  • Discuss storing JSON blob as attribute in HDF5 implementation shall we define the HDF5 type of the array?
    • This isn't in spec, we'll follow up later if it comes up again.

Notes

  • Consensus: add dense matrix and dense vector formats. (These are for true dense formats, which cannot represent sparse matrices without a fill value.)
  • Tentative names: dvec, cvec for dense and compressed vectors. dmat, dmatr, dmatc for dense matrices and row-major, column-major matrices.
  • Specify a "data_type" for "fill_value"

2023-08-28

Agenda

  • Testing progress (brief)
  • Multi-dimensional proposals

2023-08-21

Agenda

  • How to deal with complex numbers?
    • Currently store real and imaginary parts together
    • Currently allow any scalar type (e.g. integer complex types)
    • Probably should restrict to floating point
  • How to store Matrix Market pattern matrices? (As an ISO-valued true bint8?)
  • https://github.com/ivirshup/binsparse-python

Notes

  • How to deal with user-defined types?
    • Ben: ideally could handle this using binary container's user-defined type facilities.
    • e.g. could have type equal to user[user_typename where user_typename is a user-defined label.
      Parser would have to somehow keep a library of binary container objects (e.g. HDF5 type object)
      and

2023-08-07

Agenda

  • Discuss HDF5 sparse chunks RFC
  • Discuss current status of spec

Notes

  • HDF5 sparse chunks RFC seems quite different from Binsparse, but likely worth talking with them

    • They may benefit from our sparse matrix support.
    • We may benefit from their chunking.
  • Discussed current status

    • Working on bringin implementations up to spec
    • Pass matrices back and forth, identify any spec issues
  • Discussed converting SuiteSparse Matrix Collection

    • Minimum viable product: take Matrix Market tarball, convert primary Matrix Market file to Binsparse format. Package up Binsparse file and any auxiliary files in a new tarball.
    • Optional: could also convert any other Matrix Market files in the tarball.
    • Optional: could also store any text files as HDF5 datasets.
    • Converting both Matrix Market files and text files to HDF5 datasets might allow representing tarball with a single HDF5 file. However, if there are any other file types than text and Matrix Market, this would represent a problem.
    • Original SuiteSparse Matrix Collection metadata field should remain the same.

2023-07-21

Agenda

  • Implementation status?
  • When dealing with embedded Binsparse matrices, what should that look like? Should keys that correspond to binsparse matrices have a particular form? (e.g.: default is binsparse, would binsparse_MYMATRIXNAME be assumed to correspond to a Binsparse format matrix? Or should it be nested inside another dictionary? How will these be identified?)
  • What's the format for version number?
  • Should we "release" the spec? (e.g. release v ~0.9, to be increased to 1.0 after sufficient implementation experience?)
  • Questions for Tim:
    • Do we have full coverage of everything we need to store the matrices in SuiteSparse Matrix Collection?
    • How shall we decide which binary format to use for each matrix in SSMC? (e.g. we could do some quick math based on nnz/row and pick COO, CSR, or DCSR, or try to somehow use metadata to pick?)
    • How exactly will we need to repackage the metadata comment in an SSMC Matrix Market?

Notes

  • Version numbers?

    • Add a key for version.
    • Use semver with only two versions (e.g. 1.2)
    • Isaac will submit a PR adding the version key and "releasing the spec"
    • Start with version 0.5 for our current release candidate
    • For development versions, we can use a version with "dev" suffix to indicate a dev version of the spec
    • This does not necessarily need to be standardized.
  • How to determine which file format to use?

    • Use COO when the matrix type is "coordinate". Use Dense when the matrix type is "array".
  • How to nest binsparse or use multiple binsparse in the same file?

    • A binsparse parser is always pointed at an HDF5 group or HDF5 file, looks at the descriptor in that group, then parses the matrix. This allows for different formats in different subgroups. <- this needs a PR too
  • What is the minimum viable product for MatrixMarket stuff?

    • It's a script that parses a matrixmarket tarball, converts mtx files to hdf5 binsparse, and copies the comment field of mtx into the comment field of binsparse.
  • Let's add a directory of more complicated reference binsparse files for formats other than COO.

  • Let's add a list of parsers on the binsparse homepage!

2023-07-10

Conclusion: we don't need to specify

2023-06-26

What else is needed for implementations?

  • Do we need store specific documentation on formats?
    • Maybe we will punt for now
    • [Jim] But add a space for "implementation details"
  • [Isaac] Reference datasets
    • [Ben] small ones into the repo
    • [Willow] Grab these out of the matrix market
  • [Isaac] When do we start grabbing people to take a look?
    • After we discuss + implement
  • Discussion of which level pointer live at
    • Conclusion: pointers_to_level

2023-06-12

Agenda

  • Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?

  • Discuss plan for dealing with symmetric, hermitian, and skew-symmetric matrices.

Notes

  • We reviewed a lot of the discussion from the previous meeting.
  • The SuiteSparse Matrix Collection has a lot of metadata included in the download tarball. The metadata is stored as additional files stored alongside the primary matrix. The metadata's structure is defined in the Matrix Market file's comments section.
  • There are two methods we could use for storing these in Binsparse:
  1. We could store the main matrix as a Binsparse file, inside of a tarball, and store the original structured comment form the Matrix Market file inside the comments section of the Binsparse file. This is very close to the status quo, with Matrix Market replaced with Binsparse.
  2. Since Binsparse supports embedded sparse matrices, the metadata could potentially be stored alongside the Binsparse file in an HDF5 file or other container. The structured comment could also use JSON, since Binsparse comments allow arbitrary user-defined JSON.
  • The specific mechanism is fundamentally outside the spec, but we need to support handling this kind of metadata.

  • We discussed symmetry, and the PR Ben submitted adding symmetric_lower, symmetric_upper, and the same for skew-symmetric and Hermitian matrices.

  • Discussed the possibility of adding a "packed dense symmetric format," which would store a triangle of a dense matrix as a jagged array. Did not reach consensus on whether to add. There was some feeling that this is unnecessary, as we could already store a symmetric dense matrix with symmetric_lower or symmetric_upper, but wasting some disk space.

  • Discussed the possibility of adding an informational "symmetric" flag that does not affect the on-disk storage format, but simply tells the user/reader that the matrix happens to be symmetric, even though it is not necessarily in a symmetric storage format.

  • Discussed how having an information flag at the same level as the "structure" key might be confusing.

  • Discussed the possibility of adding an "attributes" or similar key, which would contain a list of informational matrix attributes. This could include an information symmetric flag, as well as attributes like being a positive matrix or having no self-loops when interpreted as a graph.

2023-05-15

Agenda

  • Discuss number of pointers and indices arrays.

  • Discuss plan for dealing with symmetric, hermitian, and skew-symmetric matrices.

  • Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?

Notes

  • Overview of SuiteSparse matrix market collection
    • Extra fields (example: imdb)
      • Pajek collection
      • Sparse matrix of movies x actors
        • extra vectors of data
      • Metadata – Seems all json representable
    • Willow: Metadata could live at a higher level (outside of the binsparse descriptor)
      • Parent group?
    • Strings
      • Are strings supported
      • HDF5, zarr, arrow have variable length string types
      • Consensus was reached that we should add string to the list of datatypes, with appropriate caveats that the implementation may need to do some hacks to get it to work.
    • Symmetry
      • Who supports packed upper triangular formats?
      • Packed dense lower triangular is the straightforward binary interpretation of the MatrixMarket matrix array symmetric real, where the entry at (i, j) is stored at location i + j(j-1)/2.
      • A note from Willow: Dense lower triangular arrays are a subset of ragged (sometimes called jagged, awkward) arrays.
      • Jim: Are we just specifying a tag?
        • We can decouple whether a matrix is symmetric vs. whether we only store a triangle of it
        • perhaps storing triangles get's pushed to v2
      • There is contention over whether we want a packed dense triangular matrix.

2023-05-01

Agenda

  • Reach consensus on how to handle bitfield boolean matrices.
  • Evaluate whether there are any other features missing in v1. Can we store all SuiteSparse Matrix Collection matrices using binsparse v1?

Notes

  • There's not a huge need to represent actual boolean matrices that are not ISO. We will leave this out of the spec for now, although we have a decent proposal for adding bit packing (see last week's notes). (e.g. bit[int8])

  • We will add bint8 to the spec to represent an unsigned 8-bit integer that represents a boolean value.

  • We will add complex[x] to represent a complex number, where x is a standard type (not complex, please), and values shall be a size n*2 and i/2'th element contains the real part of i'th value and i/2 + 1 contains the imaginary part of the i'th value.

  • We will re-merge ISO into the value.

  • Discussed

  • Symmetry is needed - likely will store symmetric_left and symmetric_right (or some equivalent with better naming.)

2023-04-17

Attendees: Willow Ahrens, Benjamin Brock, Jim Kitchen, Isaac Virshup, Erik Welch

Agenda

  • Discuss ISO-value PR
  • Discuss types in #25

Notes

  • General consensus that ISO-valueness being part of the sparse matrix format, as written in ISO-value PR, is the correct decision.

  • Question: can we support property graphs in v1, or is that completely left to v2?

  • Consensus: we plan to have some support through user-defined types. There are strategies for this:

    1. Users define their own type labels and provide a dictionary of type labels and type layouts to the parser when reading a file. Each language would have to define layouts for each format, and users could provide their own format.
    2. This should probably be made compatible with Arrow, which defines a language for user-defined types, has implementations in most languages, and has many users with their own user-defined types.

Next, we discussed the types laid out in #25. Most discussion centered on how to handle boolean-valued sparse matrices represented as bit arrays. (For instance, a CSR sparse matrix with boolean values, where the values array contains uint64 values, and the i'th element of the values array is defined by the expression (values[i / 64] >> (i % 64)) & 1).

  • There was general consensus that the "boolean bit-ness" of the values array is a quality of the values array similar to ISO-ness. In v2 this should be represented as a property of the last level, similar to ISO-ness.
  • Representing a bitfield boolean values array in the values array type may be acceptable if we make it a bit more obvious what it is. For example, by introducing a bint8, bit[int8], or similar format.

2023-04-03

Attendees: Willow Ahrens, Benjamin Brock, Tim Davis, Jim Kitchen, Isaac Virshup, Erik Welch

Notes

  • Discussed ISO values - should they be a property of the file format,
    the type of the values array, or a level in a matrix?

  • There seems to be consensus that for v2, ISO-ness should be a property of
    the level, possibly as a "is_iso" flag.

  • Ben feels that for v1, ISO-ness should be a property of the format OR value.

  • We reached consensus around Willow's proposal for v1, that ISO-ness should
    be reflected in the format of the matrix.

Willow wrote a draft of a property graph, with support for different properties at the bottom level.

{
  "swizzle": [1, 0],
  "format": {
    "level": "dense",
    "rank": 1,
    "subformat": {
      "level": "sparse",
      "rank": 1,
      "subformat": {
        "level": "multiplex",
        "subformats": {
          "weight": {
            "level": "element",
            "value_type": "float32"
          }
          "capacity": {
            "level": "iso",
            "value_type": "int32"
          }
        }
      }
    }
  },
}

2023-03-06

Attendees: Willow Ahrens, Tim Davis, Benjamin Brock, Jim Kitchen

Notes

Willow pres: Finch Sparse tensor

  • Thinking of arrays as an index tree
    • Level is vector of fibers
    • Dense levels can be implicit in memory
  • Kinds of level: Dense, Sparse, Element
    • Sparse has idx + ptr
    • Dense is implicit
  • COO is a bit more complicated
    • Collapse COO indices into one level with multiple arrays, e.g. array of coordinates
  • Permutations for changing orders of dimensions

2023-02-27

Attendees: Erik Welch, Jim Kitchen, Ben Brock, Isaac Virshup

Notes

  • Going over spec
    • Attributes: namespacing?
  • The formats
    • arrays are numbered by iteration order
  • Are we missing a format?
    • Ben: Dense? (masked dense)
  • iso values
    • Bikeshedding on the name
    • "iso" for now? But erik and isaac don't like the "1x[{dtype}]"
  • Whats next
    • Review, go over one last time, pretty close to 1.0, 0.9?

2023-02-06

Attendees: Willow Ahrens, Erik Welch, Tim Davis, Jim Kitchen, Isaac Virshup

Agenda

  • Are there implications for v1 from v2?

Notes

  • Naming of the arrays
  • 2d array for COO indices?
    • Multiple dtypes in indices was a point for 1d
    • Streaming
      • MLIR expect coordinate major
      • What does streamable mean in this context
        • Can create something a part at a
    • This doesn't include values, so still only indices
    • Bump to v2?
  • Currently:
    • inidces_0, pointers_0, etc.
  • 1 vs 0 indexing
    • Julia currently does 1 based indexing for its sparse arrays, so not binary compatible with sparse blas libraries
    • Supporting 1 and 0 across all readers could be complicated
    • could be 1.1, could be extension point
  • Future looking
    • Ideally, can we make v2 not breaking?
  • Next time
    • Willow presenting on n-dim compiler
    • Who else would be good to talk to here? MLIR, TACO

2023-01-23: graphblas binsparse meeting

Attendees: Willow Ahrens, Jim Kitchen, Benjamin Brock, Erik Welch, Isaac Virshup

Agenda

Notes

Select a repo