owned this note
owned this note
Linked with GitHub
tags: scverse, meeting, community
# scverse community meeting notes
**scverse community meetings have an open agenda. Feel free to add any points you'd like to discuss ahead of or during the meeting!** We usually start off with a ~5min flash talk about new ecosystem packages or other recent developments in the community. If you would like
to present at the meeting, please reach out to Gregor Sturm via the [scverse Zulip](https://scverse.zulipchat.com/).
* [Calendar event](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=bWFiM25mM3VuNWYzazBhZWY4bm1nbWhzMmdfMjAyMzAyMjFUMTcwMDAwWiB0b285cnZzcTB1ODI4djQ3YnQyam1hdThna0Bn&tmsrc=too9rvsq0u828v47bt2jmau8gk%40group.calendar.google.com&scp=ALL) | Repeats every 2nd Tuesday at 6pm CET
* Zoom link: https://zoom.us/j/94079118063?pwd=bjhEaEpUUllCTUVSK0VTS2g0M3NQZz09
* All attendees are expected to follow the scverse code of conduct: https://scverse.org/about/code_of_conduct/
## topic ideas for future meetings
* genomic ranges follow-up
* out-of-memory/out-of-core access in AnnData
* awkward arrays in AnnData and their use-cases
Giovanni Palla, Maren Buettner, Isaac Virshup, Alexander Malt, Clarence Mah, Cordata di vite, Emanuel Soda, Gregor Sturm, Kevin Lebrigand, Marco Varrone, Martin Fleischman, Matt Rowe, Niklas Muller-Botticher, Rahul, Ross, Sebastian, Taobo Hu, Veronika, Paul Kiessling, Mikaela Koutrouli, Levi John Wolf, Emily Laubscher, Martin Kim, Wanqui Zhang, Thomas Doughetry, Shashawat Sahay Danila Bredikhin, Lukas Heumos
* SpatialData presentation
* [twitter thread](https://twitter.com/LucaMarconato2/status/1656239450131660800?s=20)
* Q [Maren] How does this interact with squidpy?
* [GP] Should integrate well due to same data model for anndata
* [Gregor Sturm] How does the format relate to OME-Zarr
* [GP] It's OME-Zarr + a few things, those are being expanded on
* [Gregor Sturm] How do we work with multiple patients/
* [Marco Varrone] How does this work with multiple modalities
* MuData not right now, but is a priority
* [Martin Fleischmann] Geopandas dev, what kinds of stuff do we need, whats the intersection
* [Clarence Mah]
* [Levi John Wolf] https://rtree.readthedocs.io/en/latest/ The geopandas implementation would be here: https://geopandas.org/en/stable/docs/reference/sindex.html
* Multilevel representations of polygon
* Vector Tiles
* [Mikaela] Cell type proportion
Gregor Sturm, Ben Parks, Valenin Marteua, Maren Buettner, Jayram Kancherla, Davide Citarro, Martin Kim, Danila Bredikhin, Isaac Virshup, Babu Mia
* Genomic ranges: possible implementations (bioframe vs. pyranges vs. biocpy GenomicRanges)
* presentation by Jayaram Kancherla on BiocPy [GenomicRanges]([GenomicRanges](https://github.com/BiocPy/GenomicRanges))
* bioconductor datastructures are "backwards compatible at all cost"
* ArtifactDB: Schema based storage of bioformats
* + Metadata
* Goals of biocpy
* Align with artifactdb representations
* Interop between R and python
* Being a little more flexible on column types
* Nested dataframes
* a transposed anndata
* Q: Bioframe
* Q: [IV] Is Artifact Db available?
* Not yet
* Building on alibaster
* So a bunch of files, with metadata about how to read them in
* Q: [GS] Goals of biocpy
* A datastructure similar to bioconductor to get things into
* Q: [IV] How flexible on
Genomic ranges support in AnnData
* IV: bioframe is nice because it just operates on a pandas data frame. Also integration with open2c and seems pretty active
* BP: data should be easily accessible from other libraries, possibility to index important
* Nested list - grouped genomic ranges - More common in genetics, variants by point
Davide Citarro, Gregor Sturm, Isaac Virshup, Lukas Heumos, Emma Dann, Adam Gayoso, Giovanni Palla, Cordate de Vite, Can Ergen-Behr
* Flash talk by Davide Citarro: [schist](https://schist.readthedocs.io/en/latest/tutorials.html)
* Hackathon recap
* Identifying cell groups
* Stochastic block model
* Nested formulation overcomes finding small communities in large graphs
* Similarity to CellRank
* "cell marginals" similar to transition probability matrix
* Multiomic applications
* Can use multiple graphs for multiomic integration
* Use optimal transport for calculation of feature / feature dissimilarity, example of ATAC seq
* Bootstrapping on multiple subsamples
* GPU – possible, but not yet
* [Q: IV] How to get people to use this?
* Speed biggest current barrier
* [Q: GP] Is it time or memory bound?
* Can be pretty memory efficient, but limited by scaling out
* [GP] Any optimizations/ approximations
* It's a variational inference problem
* [Q: GP] How about metacells for speed up
* A metacell is kinda like a block
* But looking at dropping the whole knn graph, using a bipartite graph of genes to cells
* [Q: IV] What is the API for interacting with the tree of
* Currently only work with one level of hierarchy
* [Q: IV] is performance slow once, or multiple times
* Throughout analysis
* Priors can really cut down
*Adam Klie, Gregor Sturm, David Laub, Valentin Marteau, Maren Buettner, Trevor Manz, Sebastian Lobentanzer, Mark Keller, Danila Bredikhin, Emma Dann, Martin Kim, James Cranley, Garrett Ng, Jan Engelmann, Alinda Frolova, Laura Martins, Davide Cittaro, Can Ergen-Behr, Andrew McClusky, Kazumasa Kanemaru*
* Genomic Ranges (presentation by Qi An)
* Accessing genomic assembly metadata via SQL tables (IV)
* E.g. https://gist.github.com/ivirshup/9cb994e473f9f6f1325a361195d4d7a6#file-transcript_models_duckdb-py
* Ranges in chame ([notebook](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb) demo by DB)
* An Qi – DFKZ Heidelberg
* Genomic range usage
* Future issues:
* Categorical bug
* * [IV & DB] Should work, may be a problem on our side
* Peaks called from samples not overlapping
* [IV] Alternatives? Bins
* [DC] Use pre-existing annotations
* [MB] Alternative to macs2: genrich
* IV: Package for genome annotations based on bioconductor sqlite files
* Maren - ChIPseeker annotate chip seq peaks
* annotation source? II: pulls from ensembl
* Translating ids is quite useful
* Would also like to work with regulatory info
* ensembl regulatory information
* Distance to genes
* Sebastian Biocypher
* Knowledge graph connecting genomic annotation information
* Adapters to existing databases
* Demo on chame
* ED: solutions for plotting?
* e.g. "genome browser views"
* DB: there's an IGV jupyter widget https://github.com/igvteam/igv-notebook
* DC: deeptools https://pygenometracks.readthedocs.io/en/latest/
* AQ: [gosling](http://gosling-lang.org/examples/)
* higlass tools
* Viewing gene information would be nice
*Gregor Sturm, Alina Frolova, An Qi, Divyanshu Srivastava, Andrew McCluskey, Phillip Angerrer, Lukas Heumos, Valentin Marteau, Isaac Virshup, Adam Gayoso, Martin Kim, Mikaela Koutrouli*
* [Functional Associations using Variational Autoencoders (FAVA)](https://github.com/mikelkou/fava) (Flash talk by Mikaela Koutrouli)
* Separate IO packages (GS)
* issue for scirpy: https://github.com/scverse/scirpy/issues/385
* was also being discussed for scanpy + other modalities
* Pytorch data loaders (IV)
* scverse example datasets (GS)
* FAVA (Flash talk by Mikaela Koutrouli)
* limitation of current protein-protein interaction networks: biased source, only contains information about well-studied proteins
* --> reconstruct them from OMICS data
* naive approach: co-expression (pairwise correlation) of proteins
* Q isaac: how to get from latent space back to genes?
* Recommended data source [DepMap](https://depmap.org/portal/)
* IO package
* central package or split up by modality?
* Genomic ranges in anndata
* An Qi – Presenting next meeting
* currently using bioframe
* danila: chame ([example](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb))
* pytorch dataset
* map style: numpy, iter style
* AnnData field validation (abstract )
* Biggest barrier is out of core at the moment
* First example could be "one anndata on disk" -> pytorch data loader
* How important is shuffling?
* Theoretically quite important, but not much concrete examples
*Gregor Sturm, Giovanni Palla, Dylan Lam, Luis Omar Correa, Clarence Mah, Danila Bredikhin, Mario Acera, Lukas Heumos, Isaac Virshup, Valentine Svensson, Haarshaadri jp, Hugh Warden, Luca Marconato*
* Bentotools (Flash talk by Clarence Mah)
* State of genomic ranges (IV) (https://github.com/scverse/anndata/issues/624)
* Bentotools – https://github.com/ckmah/bento-tools
* Clarence Mah – UCSD
* RNA biology, especially organization of RNA within a cell
* New updates:
* RNAFlux – Find subcellular domains with local expression
* Spatial domains defined by clustering, in this example defined by SOM
* [GP]: Ripley statstics in squidpy assume evenly distributed points. What's available in python?
* [CM]: Currently using colocation analysis metrics quite heavily
* [IV] How has performance been going with respect to dataset-size?
* Will be a problem going forward!
* Genomic ranges
* GS: can awkward arrays represent ranges?
*Danila Bredikhin, Babu Mia, Gregor Sturm, Giovanni Palla, Jesko Wagner, Isaac Virshup*
* Testing against release candidates (isaac)
* MuData in squidpy?
* DB: not quite
* GS: Spatial Airr data, visium based\
* IV: How do people want to do spatial multimodal integration?
* IV: Testing against release candidates
* GS, IV: Can we remove the red X for a test that's okay to fail
* doesn't seem like this is possible. Closest thing that could work: return success always, and automatically add a comment that
states the optional tests failed.
* IV: Limitation – see if new release breaks downstream packages
* Cron job? Stops after 60 days
* GP: hosting of data with aws credits
* IV: yes!
* Distribution may need more thought, but lets see how much it costs
*Floarian Heyl, Isaac Virshup, Philipp Angerer, Clarence Mah, Jesko Wagner & Hugh Warden, Gregor Sturm, Giovanni Palla, Tim Treis, Kevin Lebrigand, MD Babu Mia, Anna Diamant, Andrew McCluskey, Adam Gayoso, Laurens Lehner*
* Jesko: lightning talk
* Clarence: bento-tools’ SpatialData compatibility
* Context: According to the [documentation](https://spatialdata.scverse.org/en/latest/), beta is coming soon; bento-tools is very interested in refactoring to use SpatialData instead of heavily modded AnnData to support its data structures. SpatialData meetings would be more appropriate to discuss this, but unable to attend (1AM PST).
* Isaac: scanpy GPU support – what does this API look like?
* Context: [post on scverse zulip](https://scverse.zulipchat.com/#narrow/stream/316218-repo-management/topic/sklearn.3A.20compute.20backends/near/326834041) and ["Patterns for GPU support" conversation on the Scientific Python discord](https://discord.com/channels/786703927705862175/1072596591363489792)
* Isaac (optional): Benchmarking setup for core projects, how do we run this?
* Context: [Using asv](https://asv.readthedocs.io/en/stable/), how do we call for benchmarking runs on private infrastructure? We will have a server (with GPUs) to run benchmarks + a portal server to recieve requests
* Gregor (optional): Integration tests: Leveraging GitHub actions' `cron` feature to regularly test core packages against development versions of dependencies?
* Jesko talk Drug Discovery from High Content Images
* ![overview slide](https://i.imgur.com/J0bjRzK.png)
* Single cell analysis w/ high content images
* 10-100m treated/ stained cells
* Feature extraction w/ squidpy, cellprofiler
* GP: How are you working with the images? Just features in the anndata?
* Not doing as much with the images once they hit anndata
* Loading data into a db? Maybe, but not so much of a problem yet
* GP will put in contact with *fractal(?)* developers
* IV: How does cell segmentation/feature extraction work?
* currently not custom solution, relying on cellprofiler/squidpy
* CM: Bento-tools + spatial transcriptomics
* Interested in using spatial data, but haven't yet
* Particular needs:
* Spatial subsetting
* Big usage of polygons
* GS: Differences from squidpy?
* Subcellular + mrna localization
* KL: clarification of use cases
* colocalization of transcripts, proximity of transcripts to cellular compartments
* IV: How do you identify subcellular features
* GP: FISHfactor – featurizing subcellular distribution
* TT: https://www.biorxiv.org/content/10.1101/2021.11.04.467354v1
* CM: Currently more looking for known features than unsupervised approach, know features like lack of transportation
* IV: GPU backends for scanpy
* TODO: link to slides
* which backends to support? RAPIDS, pytorch, jax
* What should be the priorities?
* Different models for GPU support
* Integration into scanpy
* sepearate package that mimicks API
* plugin system that adds functionality to scanpy
* PA: Activating GPU support should always be explicit (explicitness includes operating on a implementation-specific data structure like CuArray from RAPIDS)
* AG: Scanpy is just a vendor of other libraries for basically all algorithms.
* PA: Not everything needs to be GPU: current numba use cases could be e.g. Rust instead. This would lower the maintenance burden of multiple GPU backends as fewer functions would need it.
* GS: Could move from seperate packages (explorative) to plugin system
* IV: libraries that have done good support for both cpu and gpu?
* CM: [TensorLy](http://tensorly.org/stable/index.html)
* AG: [Aesara](https://aesara.readthedocs.io/en/latest/)
* AG: what about out-of-memory data?
* IV: rapids has good integration with dask
* GS: Testing again dev versions of upstream packages
* nightly builds to catch dependencies breaking our package?
* use `pip install --pre` to catch that early?
* TODO: figure out how notification for nightly builds work?
*Valentine Svensson, Gregor Sturm, Jesko Wagner, Philipp Angerrer, Danila Bredikhin, Giovanni Palla, Harald Voehringer, Noorsher Ahmed (Bento tools), Isaac Virshup*
* What do we want out of these meetings? (IV)
* Recap of progress since launch (cookiecutter, ecosystem listing, paper, ...) (GS)
* Tutorials page (https://github.com/scverse/scverse-tutorials/issues/43) (GS)
* What do people want out of this meeting
* JW: Want's to know plans for scverse, where things are header
* DB: Feedback from the community
* Synchronus catch up on issues
* JW: lighning talks
* Working on a number of imaging problems, could talk about
* NA: Keeping up on development with
* VS: Updates on current priorities
* suggests to always start meetings with short recap of current developments
* HV: Policies for scverse projects, best practices
* IV: https://github.com/scverse/ecosystem-packages
* Recap of projects (GS)
* IV: Pertubation data
* JW: Thinks things are generally do-able
* VS: Estimating effect size, fancier linear models
* Currently using zellkonverter, talking to lme4 brms
* brms generates stan code
* GP: pyMC?
* has used this in the past
* But brms model language is great – special additions for splines, hierarchichal models, spatial correlation
* Systems of models
* IV: pyDESeq2
* VS: Currently does simple models, group vs group
* GS: https://github.com/scverse/governance/pull/44There are PRs for more complex models
* VS: Package for extracting model fits
* IV: Genetics data
* VS: Datastructures
* GP: Person at helmholtz who we should follow up
* GP: Can we just call R?
* JW: Can we call julia/ how is julia GPU support?
* IV: Stats great, GPU not
* VS: Has used MixedModels.jl, liked it
* GS: Tutorials for scverse packages
* How do we select tutorials, where does this go
* VS: suggestion, put links to tutorial sections/ use cases clearly visible on landing page
* GS: Overview page on site, then seperate sphinx site
* NA: Would like to have bento-tools visible here, but also how
* NA: similar to ecosystem checklist, have tutorial checklist
* GS: see https://github.com/scverse/governance/pull/44