# Increasing Sparsity Support for cuML ## Sparse Algorithms ### Priority High #### 0.16 Incremental PCA (Divye): - Should be a straightforward port from [sklearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_incremental_pca.py#L278 - Good for initial support and used in Scanpy ##### Pros - Simple & straightforward implementation. Direct port from sklearn - We can implement this today right in Python w/ CuPy ##### Cons - Requires batched dense operations - Can't avoid this batched dense op by mean centering on sparse structures because of the implicit conversion to dense - We can combine the covariance matrices across the batches 1. Perform sparse a.T.dot(a) to get covariance numerator. Sum these across batches 2. Normalize and subtract product of means for each feature 3. Compute `eigsh` or `lobpcg` 4. Perform remaining steps here: https://github.com/scipy/scipy/blob/v1.4.1/scipy/sparse/linalg/eigen/arpack/arpack.py#L1873 #### 0.16 Sparse TSVD - uses QR as an optimization over LU - cuML PR for Sparse RSVD: https://github.com/rapidsai/cuml/pull/1399 - Inputs need changing to cusparse - Needs perf & correctness evaluation ##### Pros - If it works w/ sparse imports, it's almost done. #### 0.16 Sparse Input for Current PCA (Done) - As a convenience, we can support this easily by computing the covariance on the input sparse matrix as follows. Cupy sparse doesn't division or mean() currently. - `(a.T.dot(a) * (1/a.shape[0])) - cp.prod(a.sum(axis=0) * (1/a.shape[0]))` - verified this works woth a simple test program. should be able to do sparse gemm in c++ layer and just use remaining logic from existing PCA. - STATUS: Done. #### 0.16? Using TSVD for Sparse PCA Inputs - We can use trick outlined here to perform impliment implicit mean centering in TSVD: https://github.com/scikit-learn/scikit-learn/issues/12794 - And implemented here: https://github.com/facebook/fbpca/blob/master/fbpca.py - Looks like this originated from an issue in Scikit-learn and there is general interest in it from Scikit-learn. - This is more scalable than computing the entire covariance up front because it allows the solver to compute it only when needed (and potentially in parallel) ##### Pros - Might be low-hanging fruit. Worth spending a day to test #### Hierarchical Clustering - Being asked for by many folks. Recent ask from Clustergrammer2 folks. #### Sparse KNN (Corey) - https://github.com/rapidsai/cuml/issues/2105 - initial "tiled" implementation w/ cusparse shouldn't be too hard - Batched/tiled algorithm: 1. We choose a batchsize for the max number of rows to consider and perform a matrix multiply of only those rows against batches of all the other rows. it will be: ```python for row_batch in total_rows: for col_batch in total_rows: cusparsegemm(...) # compute distances for tiled batch. we tile over columns for each batch of rows, accumulating the result in the output after each tile. sort_rows_by_key(...) # sort the indices and dists by dist merge_parts(...) # merge first k of column batch with accumulated column batch output. ``` - Look into support for different distances - UMAP / T-SNE Plug in sparse support - Will be very straightforward once sparse KNN works #### Naive Bayes Remaining Variants - Status: the hard work has been done for gaussian (continuous) NB and the remaining discrete variants (e.g., bernoulli, categorical). - Scikit-learn provides an abstraction for the discrete variants and the heavy work for that was completed with the multinomial. - Complement (Divye) - Gaussian, remaining discrete variants (Corey) - STATUS: added rest of variants. need tests. ### Priority Medium #### Integrate Spectral embedding / clustering from RAFT Status: need to make sure it is performing separate spectral decomp on all components. - https://github.com/rapidsai/raft/pull/12 - STATUS: andrei is udating cuml. - Create Python wrappers - This will bring lanczos and lobpcg solvers ##### Pros - Trivial to wrap and will provide an important sparse algorithm for spectral analysis - Will allow us to implement the multi-component layout in UMAP ### Priority Low #### Wrap new spectral embedding / clustering APIs (from RAFT) in Python ## Primitives ### Priority Medium - Move sparse primitives over to RAFT - Representations, degree / row count, conversions, remove zeros, normalization - Update UMAP / T-SNE implementation to use RAFT - STATUS: This is almost done. - Multi-component spectral layout - Should be straightforward once lanczos/lobpcg solvers are in RAFT ## Pending CuPy Issues ### Priority High Status: Figure out which items are already on the hook for 8.0. Get a conversation going #### High Priority ##### Array Indexing (for scanpy filtering): - https://github.com/cupy/cupy/issues/2360 - STATUS: Ready for review. Tests and profiling are complete. ##### svds: - https://github.com/cupy/cupy/issues/2359 - We should already be able to support this w/ cusolver & cusparse - Can use lobpcg prim in RAFT and cusolver for eig decomp. - Will allow us to use this approach right in Python: https://github.com/theislab/scanpy/blob/5bc37a2b10f40463f1d90ea1d61dc599bbea2cd0/scanpy/preprocessing/_pca.py#L264 - Not straightforward how to support the `LinearOperator` using c cusolver. #### Medium Priority ##### Possible Low-hanging fruit: - Sparse Mean (for scanpy / PCA mean): https://github.com/cupy/cupy/issues/2674 - PR open. almost ready for review - Cupy sparse supports summing over axis 0, which returns a dense vector, so this shouldn't be too hard. - For different normalization techniques. We need to go to Scipy. - Sparse elementwise division and multiply - Able to multiply by reciprocals but could use this for API compatibility. - Column Subsetting (for scanpy): https://github.com/cupy/cupy/issues/2676 - STATUS: PR ready for review - Boolean Column subsetting (for scanpy) - STATUS: PR ready for review - CSR Multiply: https://github.com/cupy/cupy/issues/2675 - `dot()` seems to be supported, which allows the covariance construction at least. - sparse.linalg.norm ### Low Priority - sparse.eigs/sparse.eigsh: https://github.com/cupy/cupy/issues/2497