or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
2025-03-19 [Today]
Agenda
Notes
Erik has pushed a PR that can generate diagrams visualizing Binsparse formats, both traditional and custom.
Very useful for those trying to understand how custom formats work. We could add links/diagrams in the README, or even potentially as notes in the spec.
Hameer has made progress on PRs for using Binsparse as an in-memory format in Python. There are PRs for both array API and SciPy. They will be meeting soon to discuss if there are any issues that need to be resolved upstream.
Discussed issue C reference HDF5 parser that has been pointed out by users: builds fail under certain compilers (particularly C++ in some Clang builds) if they don't support complex number literals, which is a compiler extension. Will try to find a workaround.
2025-03-05
Agenda
In-Memory format Update
Sparse BLAS Workshop
Notes
In-memory format update: Hameer has submitted PRs adding the new Binsparse in-memory format to PyData Sparse and SciPy. They are interested and on board.
Ben was at the Sparse BLAS workshop last week. It was a productive meeting, and the API and reference implementation are moving along. We are likely to discuss file IO in the near future, and I will propose adopting Binsparse as a supported file format.
Keita from LLNL is working to adapt the Binsparse C++ parser implementation to also be able to use Metall, which allows memory-mapped file IO via C++ allocators. This could potentially be a good basis for distributed reading (and potentially distributed writing, although might require a tile-based abstraction on top of Binsparse as well).
2025-02-19
Agenda
reorganization
keys?Notes
number_of_dimensions x number_of_stored_values
.We could support this using an
is_contiguous
(or similar) key that specifies whether multi-rank sparse dimensions should be stored contiguously. Recall that custom formats can declare a coordinate format with a sparse level of rank one or greater:We could declare that the row and column dimensions shall be stored contiguously on disk with a new
is_contiguous
key.There would then be only one indices array, with the first
number_of_stored_values
values holding the row indices, and the nextnumber_of_stored_values
values holding the column indices.There are a few open questions:
indices
array? One option would be to name it the same as the first of the two dimensions that are being squashed. Another would be to give it a combined name like above.indices
a new, different name? e.g.contiguous_indices
? This would help prevent parsers not expecting contiguous indices from mis-parsing a file with contiguous index arrays.2025-02-05
Agenda
reorganization
key and other changes to support in-memory use cases.Notes
Willow:
Broadcast
is different from the others in that it represents a lazy operation similar to+
and-
.What is the primary use case? Is it that we have two libraries, both with lazy broadcast representations, and we want to move data between them with Binsparse?
What is the purpose of transpose?
dense | sparse
, but have rows and columns swapped in theirtranspose
key). For tensors, it becomes useful to treat your dimension ordering as somewhat arbitrary.Willow: should we have a canonical order of reorganization keys? e.g.,
split
,transpose
,combine
?split
,transpose
, andcombine
in any ordering.broadcast
might interrupt this.2025-01-07
Agenda
reorganization
key.Notes
Resubmit to ICPP
TODOs for the paper:
2024-12-09
Agenda
Discuss paper submission, potential extensions
Discuss plans for adding
reorganizations
key to support various kinds of in-memory views.Discuss adding support for 1-based indexing.
Discuss adding support for CSR4.
Notes
Rebuttal is submitted. ACM ICS is a possibility for resubmission.
There are two main avenues for enhancing the paper:
We discussed distributed parsing.
For reading, the primary case is likely reading a normal Binsparse file from a distributed file system (e.g. one downloaded from SuiteSparse Matrix Collection) in a parallel fashion.
This likely will involve one of two techniques:
For writing, things become more complicated. If we want to write things as quickly as possible, we could dump a series of tiles as separate Binsparse files. We would then need a mechanism to read these distributed dumps, which would probably be a collection of Binsparse matrices, either stored as different datasets or files.
However, if we want to store them in a single file, things become more complex. We would likely require an all-to-all operation to compute correct offsets or shift data to the desired processor. Distributed writing itself is also complicated. HDF5 has some support through its MPI interface, but we would likely want something more expressive.
In-memory formats require the ability to store views. The use-case is that we have a view in, say, PyTorch, that transposes or reshapes a matrix. We then want to pass this through to another framework like SciPy, Julia, etc.
The way to support this is by adding a collection of
reorganization
keys that provide a view of the matrix in some ways. This information could be ignored by the parser and they would still have a valid matrix (the original matrix before the view was created).There are a variety of possible keys: strides, reshape, transpose, index maps, and potentially more. Deciding the correct keys to add will likely depend on the use cases and implementation details.
A "reorganized" matrix potentially has less information than a traditional Binsparse matrix. For example, we may not know the number of stored values in the matrix. (Take for example the case of a strided view over a CSR's rows. We would need to add up the number of nonzeros in each row to get the total number of stored values in the view, which is extra computation that's not needed unless we actually need that value.) Index maps also might require some extra work to determine the shape of the reorganized view (although this could probably be figured out by a computer algebra system).
2024-12-5
Agenda
Notes
Contributions were not fully articulated
Did not fully support contributions we made
New contributions
2024-11-25
Agenda
Notes
(Libraries with a (1) have 1-indexed offsets and indices, while the others are 0-indexed.)
Most of these libraries have support for dense views, as well as things like broadcasting, which are all based on strides and offsets.
For sparse matrices, things are a little sparser:
Two things to consider adding to the spec:
2024-10-28
Agenda
2024-10-14
Agenda
Notes
User-Defined Data Types
byte
data type, which would allow users to store a series of bytes (e.g.byte[64]
for an array where each element is a collection of 64 bytes). This would allow a user to store an array of structs with something likebyte[n]
, wheren
is the size of the struct. The user would, however, have to worry about endianness themselves when switching between platforms, so this is not a truly portable solution for user-defined data types.2024-09-30
Agenda
Notes
2024-09-16
Agenda
Notes
2024-08-19
Agenda
Notes
Discussed GraphBLAS bindings
Discussed potential MKL bindings
Discussed Python community RFC for in-memory interchange format, potentially using Binsparse [link]
Discussed performance results benchmarks.
fast_matrix_market
to reference Binsparse HDF5 implementation using COO and CSR for reading/writing.sync
to flush writes to disk after writing. Warm benchmarks read in file once to warm up filesystem cache (this means the file will cached in memory) before reading and do not callsync
after writing, meaning the file may have only been written to memory.fast_matrix_market
can do multi-threaded reading as well.Willow: some PyData collaborators are considering using Binsparse + dlpack as an in-memory interchange format for sparse data between Scipy.Sparse, cuSPARSE, etc. They may join in future, and would likely have additional requests/requirements for Binsparse, since in-memory formats may have additional considerations.
2024-08-05
Agenda
Notes
2024-07-15
No quorum, short meeting.
2024-07-01
Agenda
Notes
What format to use for storing files for distributed/parallel reading/writing?
2024-06-17
Agenda
Notes
2024-05-28
Agenda
Notes
Ben: I have made progress on C parser, and have converted the entire SuiteSparse Matrix Collection to Binsparse.
COO Binsparse with HDF5's native GZip
-9
compression is around 122 GB. (Compared to 155 GB for GZ/Tarball'd Matrix Market.)Parsing on a single thread is significantly faster than fast matrix market. With multiple threads, fast matrix market does have slightly better performance than a single thread of Binsparse (using GZip
-9
, which is perhaps a bit unfair in favor of Matrix Market, as the file is uncompressed).Performing benchmarks involving a file system is a bit tricky. I am currently flushing the file system cache between experiments, but there are other effects (e.g. SSD cache) that may affect experiments.
Ben hasn't yet been able to find a way to host the SuiteSparse Matrix Collection in Binsparse, but a random selection of matrices is available [https://benjaminbrock.net/random_suitesparse_matrices/].
Isaac on compression:
GZip is super slow. It's the only compression HDF5 ships with, so GZip or uncompressed is often used.
LZ4 is much faster, however, and has codecs for Nvidia GPUs.
Isaac is doing a fair amount of work with TileDB, so if there is some aspect we'd like to investigate, let him know.
Erik: going to Scientific Python summit. i.e., he will be locked in a room in Seattle for several days with various leaders of Python projects. Are there any specific groups he should try to talk to about Binsparse?
2024-05-13
Agenda
number_of_stored_values
PR.Notes
Let's not define attributes: reverse PR
We should write up why we decided against attributes (possibly in the spec.)
History lesson: Rutherford-Boeing format went overboard on attributes, and it is not widely used.
Tim: there are a couple of state-of-the-art Matrix Market parsers that are faster
Tim: Matlab would be good to compare against, but this would require CSC.
2024-04-29
Agenda
Notes
Should we change how complex numbers are dealt with?
What about storing NNZ (number of stored values)?
number_of_stored_values
ornumber_of_stored_entries
ornumber_of_entries
ornumber_of_explicit_entries
Should we add a key for tensor extensions?
format
is a supported string?bytes[n]
data type2024-04-18
What do we need for 1.0?
- tests and reference output
- C reference implementation
- TACO parser based on ^
- GraphBlas parser based on ^
- Convert MatrixMarket
2024-04-01
Agenda
Yzelman: Re the file formats and parallel I/O — there’s also some recent work by Tim Griesbach called scda, but not yet released. They come from the adaptive mesh angle but have a similar principle of storing binary arrays. No hierarchical support but they do have support for variable-sized arrays, which may come into play sparse matrix side if allowing certain compression schemes that play with the index types (think CSB, CBICRS, etc). Theirs is fully seekable based on header info (and targeting distributed-memory I/O)
Thanks, I’ve noted it=) I/O is super important as in practice we found it almost always forms a significant part of the overall execution, so we spent (and continue to spend) quite a lot of effort in optimising it. But better we have a standard binary format (finally!) to optimise for, rather than everyone doing their own thing— very important effort this one
Notes
Discussed attributes PR. General agreement on the wording around attributes. Need some phrasing saying attributes can be ignored, but they must be correct if present. Discussed potential attributes, to list new attributes in GitHub comments in the PR.
number_of_diagonal_elements
contains_cycles
symmetric
,skew_symmetric
,hermitian
,anti-hermitian
,structure_symmetric
invertible
positive_definite
Discussed chunking/tiling. There are essentially two ways of offering tiling, an offset-based approach, and a tile grid-based approach.
m,n
, the indices of tilei,j
have the offseti*m,j*n
.)Erik: This works well, but we don't necessarily want a tile grid that's this rigid. Oftentimes sparse matrices have clusters of values in a small area, so we want to support tiles that are not necessarily uniform in size.
i,j
that maps its indices into a larger sparse matrix. These tiles could possibly overlap, which increases complexity. Without an explicit tile grid, a user would likely need to iterate through all tiles, check for overlap with the submatrix that they want to compute, and then add corresponding nonzeros from any overlapping tiles. This could be nice, as it would allow adding updates in a TileDB-like fashion without modifying the existing matrix. However, it is more complex.Isaac: Reading from HDF5 in parallel is possible with virtual datasets, and there is a proposal to allowing parallel IO in Zaar.
Willow: Do we actually need support for distributed to be in binsparse? We're getting into a lot of details that seem implementation-specific; could we just let the implementation decide how they use binsparse on their own?
Ben: We do need an official standard of some kind if libraries are going to move data back and forth. This standard doesn't necessarily have to be in binsparse, although I think that would be convenient. It is a good idea to get more implementation experience before we set anything in stone, though.
2024-03-19
Agenda
Notes
Ben: Should we add an NNZ property to the binsparse JSON? It is not currently part of the specification, with
number_of_elements
implicitly referred to.Willow: What exactly would the NNZ be defined as? In the ISO case does it store 1? What about symmetric?
We then had a long discussion about the possible meanings of NNZ. Albert-Jan Yzelman pointed out that for a symmetric reader, it can be much more efficient if the NNZ refers to the actual number of positions in the matrix with stored values.
Use case: when reading a Matrix Market file, "number of nonzeros" refers to the number of values stored in the Matrix Market file. In the symmetric case, this is not typically the same as the number of positions in the matrix with an entry, due to a single stored value in the triangle being mirrored across the diagonal. When unpacking a symmetric matrix into a non-symmetric storage format, as is commonly done in many sparse matrix libraries, this means you do not know the amount of storage needed until you have read through the matrix to determine the number of elements in the diagonal. The amount of storage needed is
number_of_diagonal_elements
+2*number_of_triangle_elements
.Ben: This seems like an important use case we should support. However, it does seem non-intuitive to me that the number of nonzeros listed in the JSON would sometimes not correspond to the number of stored values in binary.
What exactly do we mean by nonzeros? Well, we mean explicitly stored values, which could be zero. We then have a few different possible definitions:
What libraries have support for symmetric matrix types?
Isaac: Python libraries generally don't have support; Scipy.Sparse does not.
Julia does, but seg faults if you ask the number of nonzeros.
Tadd: In PyTorch, there's no way to get the NNZ at all from a sparse matrix.
Willow: Is it too difficult and ambiguous a question to put NNZ as a property in JSON?
What does SuiteSparse Matrix Collection do?
"Pattern entries" is the information that Albert would like here.
Ben: We previously discussed "attributes," which are optional information about the matrix that could be used for optimization. For example, an attribute could indicate that a matrix happens to be symmetric, even when the storage format is not symmetric. An attribute could also indicate that a matrix represents an undirected graph, a graph with no cycles, etc. We could provide
number_of_triangle_elements
andnumber_of_diagonal_elements
as attributes.Albert: This is sub-optimal, since the parser still has to write the inefficient case. It would be best if all the information to do optimal parsing were always there.
This would help, but would probably also be confusing. It does also require parsers that don't take advantage of this information to compute it if they don't already have it available.
Consensus:
Introduce attributes as a concept. Define
number_of_triangle_elements
andnumber_of_diagonal_elements
as optional attributes.Provide a note in the specification saying that a quality implementation would be expected to provide the
number_of_triangle_elements
attribute for a symmetric format, since this allows important optimizations.Number of nonzeros property? No consensus yet. It would seem like there are three possible options:
len(colind)
for CSR, what's currently in Matrix Market.)2*(len(colind) - number_of_diagonal_elements) + number_of_diagonal_elements
, the quantity Albert would like and "pattern entries" from Suite Sparse web interface.)Ben's opinion: 1. and 2. are the reasonable options, with 1. being the most intuitive.
2024-03-04
Agenda
Notes
2024-02-20
Notes
Spent most of discussion talking about testing:
h5diff
the results?h5diff
might be too strictinteger
becomesint16
vsint32
vsuint64
. All should be valid.)2024-02-05
Agenda
2023-12-11
Agenda
Notes
Property graphs
To handle property graphs, we can add a special "element" type to the last level
To handle user-defined types, we will rely on the binary container's support for user-defined types
However, we can also easily allow users to dump an array of structs to disk by creating a
byte[n]
type, which would storen
bytesConsensus: add a feature for storing multiple values of the same type for each value.
uint8[12]
would mean each value is an array of 12uint8
values.float32[2]
,float64[2]
, etc.Consensus: add a feature for struct of arrays.
2023-11-27
Agenda
Notes
generate_reference.jl
can pull any Matrix Market file and convert it to one of the Binsparse formats2023-10-30
Agenda
We have two main items for the agenda:
Notes
Discussed adding a set of (small) matrices to the repository
Discussed chunking
2023-10-16
Agenda
We have two main items for the agenda:
Notes
.mtx
files) and 2) text files, which can have arbitrary contents. Generally the text files can be viewed as lists of strings, where each line of the file contains a string, with the line number corresponding to some nonzero. The semantic meaning is dataset specific, however..mtx
file as a Binsparse matrix inside an HDF5 group with the same name as its filename (sans the.mtx
).txt
file as a string dataset inside the root group.2023-09-25
Agenda
Notes
2023-09-11
Agenda
Notes
2023-08-28
Agenda
2023-08-21
Agenda
pattern
matrices? (As an ISO-valuedtrue
bint8?)Notes
user[user_typename
whereuser_typename
is a user-defined label.Parser would have to somehow keep a library of binary container objects (e.g. HDF5 type object)
and
2023-08-07
Agenda
Notes
HDF5 sparse chunks RFC seems quite different from Binsparse, but likely worth talking with them
Discussed current status
Discussed converting SuiteSparse Matrix Collection
2023-07-21
Agenda
Notes
Version numbers?
How to determine which file format to use?
How to nest binsparse or use multiple binsparse in the same file?
What is the minimum viable product for MatrixMarket stuff?
Let's add a directory of more complicated reference binsparse files for formats other than COO.
Let's add a list of parsers on the binsparse homepage!
2023-07-10
Do we need a Skew Hermitian format?
More generally, should we describe how parsers should deal with unimplemented features?
Conclusion: we don't need to specify
2023-06-26
What else is needed for implementations?
pointer
live atpointers_to_level
2023-06-12
Agenda
Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?
Discuss plan for dealing with
symmetric
,hermitian
, andskew-symmetric
matrices.Notes
The specific mechanism is fundamentally outside the spec, but we need to support handling this kind of metadata.
We discussed symmetry, and the PR Ben submitted adding
symmetric_lower
,symmetric_upper
, and the same for skew-symmetric and Hermitian matrices.Discussed the possibility of adding a "packed dense symmetric format," which would store a triangle of a dense matrix as a jagged array. Did not reach consensus on whether to add. There was some feeling that this is unnecessary, as we could already store a symmetric dense matrix with
symmetric_lower
orsymmetric_upper
, but wasting some disk space.Discussed the possibility of adding an informational "symmetric" flag that does not affect the on-disk storage format, but simply tells the user/reader that the matrix happens to be symmetric, even though it is not necessarily in a symmetric storage format.
Discussed how having an information flag at the same level as the "structure" key might be confusing.
Discussed the possibility of adding an "attributes" or similar key, which would contain a list of informational matrix attributes. This could include an information symmetric flag, as well as attributes like being a positive matrix or having no self-loops when interpreted as a graph.
2023-05-15
Agenda
Discuss number of
pointers
andindices
arrays.Discuss plan for dealing with
symmetric
,hermitian
, andskew-symmetric
matrices.Do we have everything we need to store all matrices in the SuiteSparse Matrix Collection?
Notes
matrix array symmetric real
, where the entry at(i, j)
is stored at locationi + j(j-1)/2
.2023-05-01
Agenda
Notes
There's not a huge need to represent actual boolean matrices that are not ISO. We will leave this out of the spec for now, although we have a decent proposal for adding bit packing (see last week's notes). (e.g.
bit[int8]
)We will add
bint8
to the spec to represent an unsigned 8-bit integer that represents a boolean value.We will add
complex[x]
to represent a complex number, wherex
is a standard type (not complex, please), andvalues
shall be a sizen*2
andi/2
'th element contains the real part ofi'th
value andi/2 + 1
contains the imaginary part of thei'th
value.We will re-merge ISO into the value.
Discussed
Symmetry is needed - likely will store
symmetric_left
andsymmetric_right
(or some equivalent with better naming.)2023-04-17
Attendees: Willow Ahrens, Benjamin Brock, Jim Kitchen, Isaac Virshup, Erik Welch
Agenda
Notes
General consensus that ISO-valueness being part of the sparse matrix format, as written in ISO-value PR, is the correct decision.
Question: can we support property graphs in v1, or is that completely left to v2?
Consensus: we plan to have some support through user-defined types. There are strategies for this:
Next, we discussed the types laid out in #25. Most discussion centered on how to handle boolean-valued sparse matrices represented as bit arrays. (For instance, a CSR sparse matrix with boolean values, where the values array contains uint64 values, and the i'th element of the values array is defined by the expression
(values[i / 64] >> (i % 64)) & 1
).bint8
,bit[int8]
, or similar format.2023-04-03
Attendees: Willow Ahrens, Benjamin Brock, Tim Davis, Jim Kitchen, Isaac Virshup, Erik Welch
Notes
Discussed ISO values - should they be a property of the file format,
the type of the values array, or a level in a matrix?
There seems to be consensus that for v2, ISO-ness should be a property of
the level, possibly as a "is_iso" flag.
Ben feels that for v1, ISO-ness should be a property of the format OR value.
We reached consensus around Willow's proposal for v1, that ISO-ness should
be reflected in the format of the matrix.
Willow wrote a draft of a property graph, with support for different properties at the bottom level.
2023-03-06
Attendees: Willow Ahrens, Tim Davis, Benjamin Brock, Jim Kitchen
Notes
Willow pres: Finch Sparse tensor
Dense
,Sparse
,Element
Sparse
has idx + ptrDense
is implicitCOO
is a bit more complicated2023-02-27
Attendees: Erik Welch, Jim Kitchen, Ben Brock, Isaac Virshup
Notes
"1x[{dtype}]"
2023-02-06
Attendees: Willow Ahrens, Erik Welch, Tim Davis, Jim Kitchen, Isaac Virshup
Agenda
Notes
inidces_0
,pointers_0
, etc.2023-01-23: graphblas binsparse meeting
Attendees: Willow Ahrens, Jim Kitchen, Benjamin Brock, Erik Welch, Isaac Virshup
Agenda
Notes