changed 2 years ago
Published Linked with GitHub

Developer Summit 2023 report: scipy.sparse

Date: May 26, 2023
Contributors: CJ, Julien, Isaac, Dan, Ross, Stéfan, Levi, Sebastian

I've locked edit access for now. If you have changes or suggestions, let me know. CJ Carey

Context

scipy.sparse provides a widely-used set of types for working with sparse data, which mimic the numpy.matrix API. As the community continues to move away from matrix semantics, we want to provide a sparse array-like API that follows modern conventions.

This involves a lot of work, including some necessary churn for end-users, but it also provides a unique opportunity to re-evaluate many design choices that have accumulated over the last 20+ years of SciPy development.

What we accomplished

  • Reorganized class hierarchies to allow easy isinstance checking.
    • isinstance(..., sparray) now selects only sparse arrays
    • isinstance(..., spmatrix) now selects only sparse matrices
    • A new private base class (_spbase) underpins all sparse container types.
  • Streamlined the behavior of isspmatrix and isspmatrix_<fmt> helper functions to better reflect their name (see gh-18528).
    • isspmatrix* now only return True for sparse matrices, not sparse arrays.
    • issparse is the recommended function to check for either array or matrix sparse containers.
    • As part of this cleanup, internal usage of isspmatrix and the associated isspmatrix_<fmt> functions was replaced with issparse and isinstance checks, as appropriate.
  • Updated the sparray interface, leaving spmatrix untouched for backward compatibility:
    • Deprecated several methods in sparse array classes: asfptype, getrow, getcol, getnnz, getformat.
    • Deprecated .H and .Aattributes in sparse array classes.
    • Ensured that sparray doesn't downcast index arrays from 64-bit to 32-bit.
  • Came to a decision about Creation Functions for sparse arrays.
    • We identified three approaches:
      • our choice: Define new creation functions with distinct names in the existing scipy.sparse namespace. For example, scipy.sparse.diags_array() will act like scipy.sparse.diags() but return a sparse array instead of a sparse matrix.
      • Define a new scipy.sparse.array namespace. This would complicate the process of incrementally updating dependent library code, as upgrades would need to be performed in an all-or-nothing fashion.
      • Add an array=None keyword argument to each existing creation function. This would make it hard to change other arguments and clean up the API going forward.
    • This entails two updates for users over the transition period:
      • switching call-sites to use the new functions
      • (eventually) switching the new functions back to their shorter names, once the original functions are gone. This may need to coincide with a major version bump.
  • Made progress toward 1D sparse arrays
    • Generalized isshape and check_shape to optionally handle non-2d shapes.
    • Started implementing n-dimensional coo_array support (see gh-18530).
  • Explored feasibility and usefulness of defining __array_ufunc__ and other __array_*__ protocols for sparse arrays
    • See this proof of concept branch that includes some workarounds.
    • This experiment with direct ufunc fallback is slower than current specialized implementations, but could potentially reduce complexity of templating.
    • These efforts might also solve a long-standing issue related to build size/time of the sparsetools shared library. We sometimes get requests to support more dtypes (see gh-7408) but our current approach requires generating separate C++ functions covering the cross-product of all index and value types.
  • Stopped performing slow O(nnz) checks for downcasting gh-18509
    • We used to iterate through all the index arrays to see if could fit them into a smaller bitwidth. This slow check has been removed.
  • Improved documentation for sparse arrays
    • New module level documentation added gh-18516
    • Documentation of canonical formats gh-18539
  • Added missing benchmarks for matrix power gh-18553
  • Triaged older issues from the backlog
    • Fixed gh-15177: element-wise division densifies
    • Fixed gh-16929: argmin/argmax return the wrong values
    • Fixed gh-18494: mst tree ordering with np.argsort(..., kind='stable')
    • Partially addressed gh-16774: int64 index arrays by default

What remains to be done

  • 1-d sparse array support
    • Construct generic tests for 1d-array semantics.
    • Continue iterating on gh-18530: Generalize coo_array to support n-dimensional shapes.
    • Explore using a dense wrapper sparse array to enable testing / prototyping: dps_array in 2d gh-18514 and then 1d.
    • Start looking at other sparse formats that might be a good fit for 1d array support.
  • Decide whether SciPy wants to implement \(n\)-dimensional arrays (for \(n > 3\))
    • Potentially relevant formats: CSF and GCXS
  • Continue improving sparse creation functions
    • Fix gh-18555: Improve validation for scipy.sparse.eye + friends.
    • Add sparse array creation functions for eye, random, and others.
    • Deprecate some of the old sparse matrix creation functions. Candidates include spdiags, rand and identity.
  • Deprecate more matrix-specific functionality
    • Deprecate the isspmatrix_<fmt> functions now that the class hierarchy for sparray and spmatrix has been adjusted. To check for a particular format, users can write x.format == 'coo'.
  • General performance improvements
  • Adapting scikit-learn to support sparse arrays (to be discussed with scikit-learn's maintainers):
Select a repo