owned this note
owned this note
Published
Linked with GitHub
# Developer Summit 2023 report: `scipy.sparse`
Date: May 26, 2023
Contributors: CJ, Julien, Isaac, Dan, Ross, Stéfan, Levi, Sebastian
> I've locked edit access for now. If you have changes or suggestions, let me know. [name=CJ Carey]
## Context
`scipy.sparse` provides a widely-used set of types for working with sparse data, which mimic the `numpy.matrix` API. As the community continues to move away from matrix semantics, we want to provide a sparse array-like API that follows modern conventions.
This involves a lot of work, including some necessary churn for end-users, but it also provides a unique opportunity to re-evaluate many design choices that have accumulated over the last 20+ years of SciPy development.
## What we accomplished
- Reorganized class hierarchies to allow easy `isinstance` checking.
- `isinstance(..., sparray)` now selects only sparse arrays
- `isinstance(..., spmatrix)` now selects only sparse matrices
- A new private base class (`_spbase`) underpins all sparse container types.
- Streamlined the behavior of `isspmatrix` and `isspmatrix_<fmt>` helper functions to better reflect their name (see [gh-18528](https://github.com/scipy/scipy/pull/18528)).
- `isspmatrix*` now only return `True` for sparse matrices, *not* sparse arrays.
- `issparse` is the recommended function to check for either array or matrix sparse containers.
- As part of this cleanup, internal usage of `isspmatrix` and the associated `isspmatrix_<fmt>` functions was replaced with `issparse` and `isinstance` checks, as appropriate.
- Updated the `sparray` interface, leaving `spmatrix` untouched for backward compatibility:
- Deprecated several methods in sparse array classes: `asfptype`, `getrow`, `getcol`, `getnnz`, `getformat`.
- Deprecated `.H` and `.A`attributes in sparse array classes.
- Ensured that `sparray` doesn't downcast index arrays from 64-bit to 32-bit.
- Came to a decision about Creation Functions for sparse arrays.
- We identified three approaches:
- **our choice**: Define new creation functions with distinct names in the existing `scipy.sparse` namespace. For example, `scipy.sparse.diags_array()` will act like `scipy.sparse.diags()` but return a sparse array instead of a sparse matrix.
- Define a new `scipy.sparse.array` namespace. This would complicate the process of incrementally updating dependent library code, as upgrades would need to be performed in an all-or-nothing fashion.
- Add an `array=None` keyword argument to each existing creation function. This would make it hard to change other arguments and clean up the API going forward.
- This entails two updates for users over the transition period:
- switching call-sites to use the new functions
- (eventually) switching the new functions back to their shorter names, once the original functions are gone. This may need to coincide with a major version bump.
- Made progress toward 1D sparse arrays
- Generalized `isshape` and `check_shape` to optionally handle non-2d shapes.
- Started implementing n-dimensional `coo_array` support (see [gh-18530](https://github.com/scipy/scipy/pull/18530)).
- Explored feasibility and usefulness of defining `__array_ufunc__` and other `__array_*__` protocols for sparse arrays
- See this [proof of concept branch](https://github.com/seberg/scipy/pull/new/sparse-array-ufunc-hack) that includes some workarounds.
- This [experiment with direct ufunc fallback](https://github.com/seberg/scipy/pull/new/sparse-array-ufunc-machinery-hack) is slower than current specialized implementations, but could potentially reduce complexity of templating.
- These efforts might also solve a long-standing issue related to build size/time of the sparsetools shared library. We sometimes get requests to support more dtypes (see [gh-7408](https://github.com/scipy/scipy/issues/7408)) but our current approach requires generating separate C++ functions covering the cross-product of all index and value types.
- Stopped performing slow O(nnz) checks for downcasting [gh-18509](https://github.com/scipy/scipy/pull/18509)
- We used to iterate through all the index arrays to see if could fit them into a smaller bitwidth. This slow check has been removed.
- Improved documentation for sparse arrays
- New module level documentation added [gh-18516](https://github.com/scipy/scipy/pull/18516)
- Documentation of canonical formats [gh-18539](https://github.com/scipy/scipy/pull/18539)
- Added missing benchmarks for matrix power [gh-18553](https://github.com/scipy/scipy/pull/18553)
- Triaged older issues from the backlog
- Fixed [gh-15177](https://github.com/scipy/scipy/issues/15177): element-wise division densifies
- Fixed [gh-16929](https://github.com/scipy/scipy/pull/16929): argmin/argmax return the wrong values
- Fixed [gh-18494](https://github.com/scipy/scipy/issues/18494): mst tree ordering with `np.argsort(..., kind='stable')`
- Partially addressed [gh-16774](https://github.com/scipy/scipy/issues/16774): int64 index arrays by default
## What remains to be done
- 1-d sparse array support
- Construct generic tests for 1d-array semantics.
- Continue iterating on [gh-18530](https://github.com/scipy/scipy/pull/18530): Generalize `coo_array` to support n-dimensional shapes.
- Explore using a dense wrapper sparse array to enable testing / prototyping: `dps_array` in 2d [gh-18514](https://github.com/scipy/scipy/pull/18514) and then 1d.
- Start looking at other sparse formats that might be a good fit for 1d array support.
- Decide whether SciPy wants to implement $n$-dimensional arrays (for $n > 3$)
- Potentially relevant formats: CSF and GCXS
- Continue improving sparse creation functions
- Fix [gh-18555](https://github.com/scipy/scipy/issues/18555): Improve validation for `scipy.sparse.eye` + friends.
- Add sparse array creation functions for `eye`, `random`, and others.
- Deprecate some of the old sparse matrix creation functions. Candidates include `spdiags`, `rand` and `identity`.
- Deprecate more matrix-specific functionality
- Deprecate the `isspmatrix_<fmt>` functions now that the class hierarchy for `sparray` and `spmatrix` has been adjusted. To check for a particular format, users can write `x.format == 'coo'`.
- General performance improvements
- Address [gh-18546](https://github.com/scipy/scipy/issues/18546): Fast path for sparse array "canonical format"
- [MAINT Remove redundant checks in sparse arrays and sparse matrices](https://github.com/scipy/scipy/pull/18557)
- Adapting scikit-learn to support sparse arrays (to be discussed with scikit-learn's maintainers):
- [RFC Supporting `scipy.sparse.sparray`](https://github.com/scikit-learn/scikit-learn/issues/26418)
- [MAINT Use `scipy.sparse.isspmatrix_*`](https://github.com/scikit-learn/scikit-learn/pull/26420)
- [Decide on adding support for mixed dtype indices](https://github.com/scipy/scipy/issues/16774)