# Developer Summit 2023 report: `scipy.sparse` Date: May 26, 2023 Contributors: CJ, Julien, Isaac, Dan, Ross, Stéfan, Levi, Sebastian > I've locked edit access for now. If you have changes or suggestions, let me know. [name=CJ Carey] ## Context `scipy.sparse` provides a widely-used set of types for working with sparse data, which mimic the `numpy.matrix` API. As the community continues to move away from matrix semantics, we want to provide a sparse array-like API that follows modern conventions. This involves a lot of work, including some necessary churn for end-users, but it also provides a unique opportunity to re-evaluate many design choices that have accumulated over the last 20+ years of SciPy development. ## What we accomplished - Reorganized class hierarchies to allow easy `isinstance` checking. - `isinstance(..., sparray)` now selects only sparse arrays - `isinstance(..., spmatrix)` now selects only sparse matrices - A new private base class (`_spbase`) underpins all sparse container types. - Streamlined the behavior of `isspmatrix` and `isspmatrix_<fmt>` helper functions to better reflect their name (see [gh-18528](https://github.com/scipy/scipy/pull/18528)). - `isspmatrix*` now only return `True` for sparse matrices, *not* sparse arrays. - `issparse` is the recommended function to check for either array or matrix sparse containers. - As part of this cleanup, internal usage of `isspmatrix` and the associated `isspmatrix_<fmt>` functions was replaced with `issparse` and `isinstance` checks, as appropriate. - Updated the `sparray` interface, leaving `spmatrix` untouched for backward compatibility: - Deprecated several methods in sparse array classes: `asfptype`, `getrow`, `getcol`, `getnnz`, `getformat`. - Deprecated `.H` and `.A`attributes in sparse array classes. - Ensured that `sparray` doesn't downcast index arrays from 64-bit to 32-bit. - Came to a decision about Creation Functions for sparse arrays. - We identified three approaches: - **our choice**: Define new creation functions with distinct names in the existing `scipy.sparse` namespace. For example, `scipy.sparse.diags_array()` will act like `scipy.sparse.diags()` but return a sparse array instead of a sparse matrix. - Define a new `scipy.sparse.array` namespace. This would complicate the process of incrementally updating dependent library code, as upgrades would need to be performed in an all-or-nothing fashion. - Add an `array=None` keyword argument to each existing creation function. This would make it hard to change other arguments and clean up the API going forward. - This entails two updates for users over the transition period: - switching call-sites to use the new functions - (eventually) switching the new functions back to their shorter names, once the original functions are gone. This may need to coincide with a major version bump. - Made progress toward 1D sparse arrays - Generalized `isshape` and `check_shape` to optionally handle non-2d shapes. - Started implementing n-dimensional `coo_array` support (see [gh-18530](https://github.com/scipy/scipy/pull/18530)). - Explored feasibility and usefulness of defining `__array_ufunc__` and other `__array_*__` protocols for sparse arrays - See this [proof of concept branch](https://github.com/seberg/scipy/pull/new/sparse-array-ufunc-hack) that includes some workarounds. - This [experiment with direct ufunc fallback](https://github.com/seberg/scipy/pull/new/sparse-array-ufunc-machinery-hack) is slower than current specialized implementations, but could potentially reduce complexity of templating. - These efforts might also solve a long-standing issue related to build size/time of the sparsetools shared library. We sometimes get requests to support more dtypes (see [gh-7408](https://github.com/scipy/scipy/issues/7408)) but our current approach requires generating separate C++ functions covering the cross-product of all index and value types. - Stopped performing slow O(nnz) checks for downcasting [gh-18509](https://github.com/scipy/scipy/pull/18509) - We used to iterate through all the index arrays to see if could fit them into a smaller bitwidth. This slow check has been removed. - Improved documentation for sparse arrays - New module level documentation added [gh-18516](https://github.com/scipy/scipy/pull/18516) - Documentation of canonical formats [gh-18539](https://github.com/scipy/scipy/pull/18539) - Added missing benchmarks for matrix power [gh-18553](https://github.com/scipy/scipy/pull/18553) - Triaged older issues from the backlog - Fixed [gh-15177](https://github.com/scipy/scipy/issues/15177): element-wise division densifies - Fixed [gh-16929](https://github.com/scipy/scipy/pull/16929): argmin/argmax return the wrong values - Fixed [gh-18494](https://github.com/scipy/scipy/issues/18494): mst tree ordering with `np.argsort(..., kind='stable')` - Partially addressed [gh-16774](https://github.com/scipy/scipy/issues/16774): int64 index arrays by default ## What remains to be done - 1-d sparse array support - Construct generic tests for 1d-array semantics. - Continue iterating on [gh-18530](https://github.com/scipy/scipy/pull/18530): Generalize `coo_array` to support n-dimensional shapes. - Explore using a dense wrapper sparse array to enable testing / prototyping: `dps_array` in 2d [gh-18514](https://github.com/scipy/scipy/pull/18514) and then 1d. - Start looking at other sparse formats that might be a good fit for 1d array support. - Decide whether SciPy wants to implement $n$-dimensional arrays (for $n > 3$) - Potentially relevant formats: CSF and GCXS - Continue improving sparse creation functions - Fix [gh-18555](https://github.com/scipy/scipy/issues/18555): Improve validation for `scipy.sparse.eye` + friends. - Add sparse array creation functions for `eye`, `random`, and others. - Deprecate some of the old sparse matrix creation functions. Candidates include `spdiags`, `rand` and `identity`. - Deprecate more matrix-specific functionality - Deprecate the `isspmatrix_<fmt>` functions now that the class hierarchy for `sparray` and `spmatrix` has been adjusted. To check for a particular format, users can write `x.format == 'coo'`. - General performance improvements - Address [gh-18546](https://github.com/scipy/scipy/issues/18546): Fast path for sparse array "canonical format" - [MAINT Remove redundant checks in sparse arrays and sparse matrices](https://github.com/scipy/scipy/pull/18557) - Adapting scikit-learn to support sparse arrays (to be discussed with scikit-learn's maintainers): - [RFC Supporting `scipy.sparse.sparray`](https://github.com/scikit-learn/scikit-learn/issues/26418) - [MAINT Use `scipy.sparse.isspmatrix_*`](https://github.com/scikit-learn/scikit-learn/pull/26420) - [Decide on adding support for mixed dtype indices](https://github.com/scipy/scipy/issues/16774)