tags: scverse, meeting, community
# scverse community meeting notes
**scverse community meetings have an open agenda. Feel free to add any points you'd like to discuss ahead of or during the meeting!** We usually start off with a ~5min flash talk about new ecosystem packages or other recent developments in the community. If you would like to present at the meeting, please reach out to Mikaela Koutrouli via the [scverse Zulip](https://scverse.zulipchat.com/).
* [Calendar event](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=bWFiM25mM3VuNWYzazBhZWY4bm1nbWhzMmdfMjAyMzAyMjFUMTcwMDAwWiB0b285cnZzcTB1ODI4djQ3YnQyam1hdThna0Bn&tmsrc=too9rvsq0u828v47bt2jmau8gk%40group.calendar.google.com&scp=ALL) | Repeats every 2nd Tuesday at 6pm CET
* [Zoom link](https://zoom.us/j/94079118063?pwd=bjhEaEpUUllCTUVSK0VTS2g0M3NQZz09)
* All attendees are expected to follow the [scverse code of conduct]( https://scverse.org/about/code_of_conduct/)
*Severin Dicks, Isaac Virshup, Mikaela Koutrouli, Lukas Heumos, Giovanni Pallla, Elja Roellin, Maren Büttner, Cebrail, Illaria Billato, Sri Varra, Jennifer Foltz, Vincent Van Deuren, Davide Cittaro*
*Boris, Constantin, Gregor, Lukas, Anna, Tim Treis, Ori, Eljas, Babu Mia, Severin, Kveta, Zhaleh Safikhani, Francesca Drummer, Martin Kim, Taobo Hu, Phil Angerer, Jie Quan, Danila Bredikhin, Matilde Conte, Anna, Paul Johnston*
* [Tim]: Are you looking for exact results? Why are there differences?
* Differences are from underlying dependencies, e.g. scipy optimizers for pydeseq2
* Different choice in calulation of dispersion
* DESeq thresholds
* [Constantin] Are multiple covariates supported?
* [Isaac] How did you find making a statistical library like this in python?
* C++ code in deseq2 around GLMs was hard to replicate
* Hardest part was identifying exactly how DESeq does outlier detection
* Most wished there was a more scalable way to fit glms
* [Gregor] how are you testing? what about updates?
* Breaking changes have come from dependencies `glmGamPoi`
* Plan to update
* [Babu Mia] Do you want to add more/ go beyone DESeq2?
* [Isaac] How does outlier detection work?
* Cooks distance
* Talk by Leah Wasser, pyopensci
* [Isaac] What are the bounds on package size?
* numpy too big
* but willing to go a little smaller than joss requirements
* [Phil]: this stuff is great! Especially docs
* [Gregor] How is finding reviewers?
* Can be hard for more esoteric/ not covered topics
* [Gregor] How is continuing review?
* Still in progress, but will have automated metric collection
* What happens if standards change?
* [Isaac] How is the astropy transition?
* Still proof of concept, starting with new pacakges
* Will probably go back later
* [Isaac] Ropen sci was an inspiration?
* Cohesive feeling in community
* Kartik of ropensci initially
* [Isaac] where do we go to follow up
* slack, but closed so only a few can join
* discourse less active
* [Phil] recommends https://discord.com/invite/pypa
* Isaac: where can we follow what you're up to?
* Pangeo partnership page: https://www.pyopensci.org/software-peer-review/partners/pangeo.html
*Mikaela Koutrouli, Isaac Virshup, Gregor Sturm, Zoe Piran, Paul L Maurizio, Danila Bredhikin, Taobo Hu, Jennifer Foltz, Evan Lyall, Philipp Angerer*
* Zoe Piran: [Biolord - biological representation disentanglement](https://biolord.readthedocs.io/en/latest/)
* Mikaela – go over classifiers again
* Gregor – What were the points in the embeddings
* Jennifer – Can you use multiomics for biolord classify
* Yes for embedding, not for impute of other modality
*Geert Jan Huzing, Mikaela Koutrouli, Gregor Strum, Valentin Marteau, Eljas, Maria Puschhof, Zoe Piran, Lorenzo Merotto, Carson Poltorack, Jennifer Foltz, Ligan Zhu, Dominik Klein, Zhaleh, Danila*
* Geert-Jan Huizing will talk about Mowgli, a method for the integration of paired multi-omics data
### Discussion points:
* Gregor: how well does this scale out
* Isaac: any ideas for how we can better benchmark these kinds of methods
* It's hard
* Maybe downstream tasks better?
* Why doesn't the method look as great in the open problems benchmark as with Simulated?
* Basically is messier
* Isaac: NMF
* Have seen that is is stable
* Reguarization helps a lot
* Zhaleh: why doesn't Seurat show up in some of the plots?
* Seurat doesn't provide an embedding space, so there's no way to measure sillouhette score
* Zhaleh: Speed of methods?
* Depends on the type of data, Mowgli can be slow with a high number of features
* Carson (text question): In your NMF you showed that the more "interpretable" programs were ones that had cell-type specificity but is Mowgli able to find factors that are common to a few subsets of cell types in the overall dataset? like if there's a "cell process A" factor, and there are 4 (of 12) cell types in the dataset whose ground truth is high for cell process A, how would you distinguish that factor being meaningful vs nonmeaningful?
* Example of features showing up
*Mikaela Koutrouli, Eljas Röllin, Marco Varrone, Valentin Marteau, Tu Hu, Maria Puschhof, Maren Buettner, Juan Luis Cadavid Cardenas, Constantin Ahlmann-Eltze, Taobo Hu, Soufiane Mourragui, Lena Morrill, Lina, Jena Foltz, Martin Kim, Vittorioz, Zhaleh, Christian Feregiano*
* PauBadiaM: talking about pseudobulk and differential expression analysis w decoupler
* [Mikaela] Google form for feedback
* Questions on Talk
* [Tu HU] May I get the slides?
* [Maren] How to integrate mixed effects
* Not sure if pydeseq2 supports that
* [Maren] Differences in abundence changing results of DE
* Yeah, it's a problem, other methods can work around it but can be computationally expensive
* [Tu Hu] Batch effect correction via MNN
* [Tu Hu] P-values for
* [Marco Varrone] Using prior knowledge for pathways
* Available through omnipath package
* [Valentin] Lines on volcano plots
* Log fold change shifted to zero
* [Constantine] Can occur to small discrete values that show up in single cell, e.g. small counts
* [Jennifer] Cusom marker genes?
* Just need to define a "source"
* [Mikaela] Scaling of decoupler?
* Python package is faster
* [Lina] Generation of pseudo bulk – do you account for cell number differences, or DE-seq?
* Assume de-seq normalization handles this
* Have not explored though
* [Isaac] I have seen effects from this
* Pau: could regress number of cells? But pydeseq doesn't have that
* [Isaac] Is using just a mean to describe a psuedobulk enough? Or do we need to use at least a need a mean and variance
* Pau: could be bettr but
## topic ideas for future meetings
* genomic ranges follow-up
* out-of-memory/out-of-core access in AnnData
* awkward arrays in AnnData and their use-cases
*Giovanni Palla, Maren Buettner, Isaac Virshup, Alexander Malt, Clarence Mah, Cordata di vite, Emanuel Soda, Gregor Sturm, Kevin Lebrigand, Marco Varrone, Martin Fleischman, Matt Rowe, Niklas Muller-Botticher, Rahul, Ross, Sebastian, Taobo Hu, Veronika, Paul Kiessling, Mikaela Koutrouli, Levi John Wolf, Emily Laubscher, Martin Kim, Wanqui Zhang, Thomas Doughetry, Shashawat Sahay Danila Bredikhin, Lukas Heumos*
* SpatialData presentation
* [twitter thread](https://twitter.com/LucaMarconato2/status/1656239450131660800?s=20)
* Q [Maren] How does this interact with squidpy?
* [GP] Should integrate well due to same data model for anndata
* [Gregor Sturm] How does the format relate to OME-Zarr
* [GP] It's OME-Zarr + a few things, those are being expanded on
* [Gregor Sturm] How do we work with multiple patients/
* [Marco Varrone] How does this work with multiple modalities
* MuData not right now, but is a priority
* [Martin Fleischmann] Geopandas dev, what kinds of stuff do we need, whats the intersection
* [Clarence Mah]
* [Levi John Wolf] https://rtree.readthedocs.io/en/latest/ The geopandas implementation would be here: https://geopandas.org/en/stable/docs/reference/sindex.html
* Multilevel representations of polygon
* Vector Tiles
* [Mikaela] Cell type proportion
*Gregor Sturm, Ben Parks, Valenin Marteua, Maren Buettner, Jayram Kancherla, Davide Citarro, Martin Kim, Danila Bredikhin, Isaac Virshup, Babu Mia*
* Genomic ranges: possible implementations (bioframe vs. pyranges vs. biocpy GenomicRanges)
* presentation by Jayaram Kancherla on BiocPy [GenomicRanges]([GenomicRanges](https://github.com/BiocPy/GenomicRanges))
* bioconductor datastructures are "backwards compatible at all cost"
* ArtifactDB: Schema based storage of bioformats
* + Metadata
* Goals of biocpy
* Align with artifactdb representations
* Interop between R and python
* Being a little more flexible on column types
* Nested dataframes
* a transposed anndata
* Q: Bioframe
* Q: [IV] Is Artifact Db available?
* Not yet
* Building on alibaster
* So a bunch of files, with metadata about how to read them in
* Q: [GS] Goals of biocpy
* A datastructure similar to bioconductor to get things into
* Q: [IV] How flexible on
Genomic ranges support in AnnData
* IV: bioframe is nice because it just operates on a pandas data frame. Also integration with open2c and seems pretty active
* BP: data should be easily accessible from other libraries, possibility to index important
* Nested list - grouped genomic ranges - More common in genetics, variants by point
*Davide Citarro, Gregor Sturm, Isaac Virshup, Lukas Heumos, Emma Dann, Adam Gayoso, Giovanni Palla, Cordate de Vite, Can Ergen-Behr*
* Flash talk by Davide Citarro: [schist](https://schist.readthedocs.io/en/latest/tutorials.html)
* Hackathon recap
* Identifying cell groups
* Stochastic block model
* Nested formulation overcomes finding small communities in large graphs
* Similarity to CellRank
* "cell marginals" similar to transition probability matrix
* Multiomic applications
* Can use multiple graphs for multiomic integration
* Use optimal transport for calculation of feature / feature dissimilarity, example of ATAC seq
* Bootstrapping on multiple subsamples
* GPU – possible, but not yet
* [Q: IV] How to get people to use this?
* Speed biggest current barrier
* [Q: GP] Is it time or memory bound?
* Can be pretty memory efficient, but limited by scaling out
* [GP] Any optimizations/ approximations
* It's a variational inference problem
* [Q: GP] How about metacells for speed up
* A metacell is kinda like a block
* But looking at dropping the whole knn graph, using a bipartite graph of genes to cells
* [Q: IV] What is the API for interacting with the tree of
* Currently only work with one level of hierarchy
* [Q: IV] is performance slow once, or multiple times
* Throughout analysis
* Priors can really cut down
*Adam Klie, Gregor Sturm, David Laub, Valentin Marteau, Maren Buettner, Trevor Manz, Sebastian Lobentanzer, Mark Keller, Danila Bredikhin, Emma Dann, Martin Kim, James Cranley, Garrett Ng, Jan Engelmann, Alinda Frolova, Laura Martins, Davide Cittaro, Can Ergen-Behr, Andrew McClusky, Kazumasa Kanemaru*
* Genomic Ranges (presentation by Qi An)
* Accessing genomic assembly metadata via SQL tables (IV)
* E.g. https://gist.github.com/ivirshup/9cb994e473f9f6f1325a361195d4d7a6#file-transcript_models_duckdb-py
* Ranges in chame ([notebook](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb) demo by DB)
* An Qi – DFKZ Heidelberg
* Genomic range usage
* Future issues:
* Categorical bug
* * [IV & DB] Should work, may be a problem on our side
* Peaks called from samples not overlapping
* [IV] Alternatives? Bins
* [DC] Use pre-existing annotations
* [MB] Alternative to macs2: genrich
* IV: Package for genome annotations based on bioconductor sqlite files
* Maren - ChIPseeker annotate chip seq peaks
* annotation source? II: pulls from ensembl
* Translating ids is quite useful
* Would also like to work with regulatory info
* ensembl regulatory information
* Distance to genes
* Sebastian Biocypher
* Knowledge graph connecting genomic annotation information
* Adapters to existing databases
* Demo on chame
* ED: solutions for plotting?
* e.g. "genome browser views"
* DB: there's an IGV jupyter widget https://github.com/igvteam/igv-notebook
* DC: deeptools https://pygenometracks.readthedocs.io/en/latest/
* AQ: [gosling](http://gosling-lang.org/examples/)
* higlass tools
* Viewing gene information would be nice
*Gregor Sturm, Alina Frolova, An Qi, Divyanshu Srivastava, Andrew McCluskey, Phillip Angerrer, Lukas Heumos, Valentin Marteau, Isaac Virshup, Adam Gayoso, Martin Kim, Mikaela Koutrouli*
* [Functional Associations using Variational Autoencoders (FAVA)](https://github.com/mikelkou/fava) (Flash talk by Mikaela Koutrouli)
* Separate IO packages (GS)
* issue for scirpy: https://github.com/scverse/scirpy/issues/385
* was also being discussed for scanpy + other modalities
* Pytorch data loaders (IV)
* scverse example datasets (GS)
* FAVA (Flash talk by Mikaela Koutrouli)
* limitation of current protein-protein interaction networks: biased source, only contains information about well-studied proteins
* --> reconstruct them from OMICS data
* naive approach: co-expression (pairwise correlation) of proteins
* Q isaac: how to get from latent space back to genes?
* Recommended data source [DepMap](https://depmap.org/portal/)
* IO package
* central package or split up by modality?
* Genomic ranges in anndata
* An Qi – Presenting next meeting
* currently using bioframe
* danila: chame ([example](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb))
* pytorch dataset
* map style: numpy, iter style
* AnnData field validation (abstract )
* Biggest barrier is out of core at the moment
* First example could be "one anndata on disk" -> pytorch data loader
* How important is shuffling?
* Theoretically quite important, but not much concrete examples
*Gregor Sturm, Giovanni Palla, Dylan Lam, Luis Omar Correa, Clarence Mah, Danila Bredikhin, Mario Acera, Lukas Heumos, Isaac Virshup, Valentine Svensson, Haarshaadri jp, Hugh Warden, Luca Marconato*
* Bentotools (Flash talk by Clarence Mah)
* State of genomic ranges (IV) (https://github.com/scverse/anndata/issues/624)
* Bentotools – https://github.com/ckmah/bento-tools
* Clarence Mah – UCSD
* RNA biology, especially organization of RNA within a cell
* New updates:
* RNAFlux – Find subcellular domains with local expression
* Spatial domains defined by clustering, in this example defined by SOM
* [GP]: Ripley statstics in squidpy assume evenly distributed points. What's available in python?
* [CM]: Currently using colocation analysis metrics quite heavily
* [IV] How has performance been going with respect to dataset-size?
* Will be a problem going forward!
* Genomic ranges
* GS: can awkward arrays represent ranges?
*Danila Bredikhin, Babu Mia, Gregor Sturm, Giovanni Palla, Jesko Wagner, Isaac Virshup*
* Testing against release candidates (isaac)
* MuData in squidpy?
* DB: not quite
* GS: Spatial Airr data, visium based\
* IV: How do people want to do spatial multimodal integration?
* IV: Testing against release candidates
* GS, IV: Can we remove the red X for a test that's okay to fail
* doesn't seem like this is possible. Closest thing that could work: return success always, and automatically add a comment that
states the optional tests failed.
* IV: Limitation – see if new release breaks downstream packages
* Cron job? Stops after 60 days
* GP: hosting of data with aws credits
* IV: yes!
* Distribution may need more thought, but lets see how much it costs
*Floarian Heyl, Isaac Virshup, Philipp Angerer, Clarence Mah, Jesko Wagner & Hugh Warden, Gregor Sturm, Giovanni Palla, Tim Treis, Kevin Lebrigand, MD Babu Mia, Anna Diamant, Andrew McCluskey, Adam Gayoso, Laurens Lehner*
* Jesko: lightning talk
* Clarence: bento-tools’ SpatialData compatibility
* Context: According to the [documentation](https://spatialdata.scverse.org/en/latest/), beta is coming soon; bento-tools is very interested in refactoring to use SpatialData instead of heavily modded AnnData to support its data structures. SpatialData meetings would be more appropriate to discuss this, but unable to attend (1AM PST).
* Isaac: scanpy GPU support – what does this API look like?
* Context: [post on scverse zulip](https://scverse.zulipchat.com/#narrow/stream/316218-repo-management/topic/sklearn.3A.20compute.20backends/near/326834041) and ["Patterns for GPU support" conversation on the Scientific Python discord](https://discord.com/channels/786703927705862175/1072596591363489792)
* Isaac (optional): Benchmarking setup for core projects, how do we run this?
* Context: [Using asv](https://asv.readthedocs.io/en/stable/), how do we call for benchmarking runs on private infrastructure? We will have a server (with GPUs) to run benchmarks + a portal server to recieve requests
* Gregor (optional): Integration tests: Leveraging GitHub actions' `cron` feature to regularly test core packages against development versions of dependencies?
* Jesko talk Drug Discovery from High Content Images
* ![overview slide](https://i.imgur.com/J0bjRzK.png)
* Single cell analysis w/ high content images
* 10-100m treated/ stained cells
* Feature extraction w/ squidpy, cellprofiler
* GP: How are you working with the images? Just features in the anndata?
* Not doing as much with the images once they hit anndata
* Loading data into a db? Maybe, but not so much of a problem yet
* GP will put in contact with *fractal(?)* developers
* IV: How does cell segmentation/feature extraction work?
* currently not custom solution, relying on cellprofiler/squidpy
* CM: Bento-tools + spatial transcriptomics
* Interested in using spatial data, but haven't yet
* Particular needs:
* Spatial subsetting
* Big usage of polygons
* GS: Differences from squidpy?
* Subcellular + mrna localization
* KL: clarification of use cases
* colocalization of transcripts, proximity of transcripts to cellular compartments
* IV: How do you identify subcellular features
* GP: FISHfactor – featurizing subcellular distribution
* TT: https://www.biorxiv.org/content/10.1101/2021.11.04.467354v1
* CM: Currently more looking for known features than unsupervised approach, know features like lack of transportation
* IV: GPU backends for scanpy
* TODO: link to slides
* which backends to support? RAPIDS, pytorch, jax
* What should be the priorities?
* Different models for GPU support
* Integration into scanpy
* sepearate package that mimicks API
* plugin system that adds functionality to scanpy
* PA: Activating GPU support should always be explicit (explicitness includes operating on a implementation-specific data structure like CuArray from RAPIDS)
* AG: Scanpy is just a vendor of other libraries for basically all algorithms.
* PA: Not everything needs to be GPU: current numba use cases could be e.g. Rust instead. This would lower the maintenance burden of multiple GPU backends as fewer functions would need it.
* GS: Could move from seperate packages (explorative) to plugin system
* IV: libraries that have done good support for both cpu and gpu?
* CM: [TensorLy](http://tensorly.org/stable/index.html)
* AG: [Aesara](https://aesara.readthedocs.io/en/latest/)
* AG: what about out-of-memory data?
* IV: rapids has good integration with dask
* GS: Testing again dev versions of upstream packages
* nightly builds to catch dependencies breaking our package?
* use `pip install --pre` to catch that early?
* TODO: figure out how notification for nightly builds work?
*Valentine Svensson, Gregor Sturm, Jesko Wagner, Philipp Angerrer, Danila Bredikhin, Giovanni Palla, Harald Voehringer, Noorsher Ahmed (Bento tools), Isaac Virshup*
* What do we want out of these meetings? (IV)
* Recap of progress since launch (cookiecutter, ecosystem listing, paper, ...) (GS)
* Tutorials page (https://github.com/scverse/scverse-tutorials/issues/43) (GS)
* What do people want out of this meeting
* JW: Want's to know plans for scverse, where things are header
* DB: Feedback from the community
* Synchronus catch up on issues
* JW: lighning talks
* Working on a number of imaging problems, could talk about
* NA: Keeping up on development with
* VS: Updates on current priorities
* suggests to always start meetings with short recap of current developments
* HV: Policies for scverse projects, best practices
* IV: https://github.com/scverse/ecosystem-packages
* Recap of projects (GS)
* IV: Pertubation data
* JW: Thinks things are generally do-able
* VS: Estimating effect size, fancier linear models
* Currently using zellkonverter, talking to lme4 brms
* brms generates stan code
* GP: pyMC?
* has used this in the past
* But brms model language is great – special additions for splines, hierarchichal models, spatial correlation
* Systems of models
* IV: pyDESeq2
* VS: Currently does simple models, group vs group
* GS: https://github.com/scverse/governance/pull/44There are PRs for more complex models
* VS: Package for extracting model fits
* IV: Genetics data
* VS: Datastructures
* GP: Person at helmholtz who we should follow up
* GP: Can we just call R?
* JW: Can we call julia/ how is julia GPU support?
* IV: Stats great, GPU not
* VS: Has used MixedModels.jl, liked it
* GS: Tutorials for scverse packages
* How do we select tutorials, where does this go
* VS: suggestion, put links to tutorial sections/ use cases clearly visible on landing page
* GS: Overview page on site, then seperate sphinx site
* NA: Would like to have bento-tools visible here, but also how
* NA: similar to ecosystem checklist, have tutorial checklist
* GS: see https://github.com/scverse/governance/pull/44