scverse community meeting notes

--- tags: scverse, meeting, community --- # scverse community meeting notes **scverse community meetings have an open agenda. Feel free to add any points you'd like to discuss ahead of or during the meeting!** We usually start off with a ~5min flash talk about new ecosystem packages or other recent developments in the community. If you would like to present at the meeting, please reach out to Mikaela Koutrouli via the [scverse Zulip](https://scverse.zulipchat.com/). * [Calendar event](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=bWFiM25mM3VuNWYzazBhZWY4bm1nbWhzMmdfMjAyMzAyMjFUMTcwMDAwWiB0b285cnZzcTB1ODI4djQ3YnQyam1hdThna0Bn&tmsrc=too9rvsq0u828v47bt2jmau8gk%40group.calendar.google.com&scp=ALL) | Repeats every 2nd Tuesday at 6pm CET * Zoom link changes, will be posted in [Zulip](https://scverse.zulipchat.com/#narrow/stream/315185-general/topic/scverse.20community.20meetings) ~~[Zoom link](https://zoom.us/j/94079118063?pwd=bjhEaEpUUllCTUVSK0VTS2g0M3NQZz09)~~c * All attendees are expected to follow the [scverse code of conduct]( https://scverse.org/about/code_of_conduct/) ## 2024-04-16 ### Attendees *Mikaela Koutrouli, Dognze He, Hong Jiang, Igor Martayan, Javier Marchena Hurtado, Jennife Foltz, Mariel Cortes Lopez, Martin Kim, Paul Kiessling, Sikander Hayat, Lukas Heumos ### Agenda * scverse conference advertisement * Dongze He will talk about about [simpleaf](https://github.com/DongzeHE/simpleaf), a rust framework to make using alevin-fry even simpler. ### Notes RECOMB conference is taking place in pa * Q [Javier] Differences with cellranger * Small differences but for the actual counts we use less memory and runs faster. Tried their best to avoid caviets dure to alignment. As accurate as STARsolo. Simpel counts matrices generation. Dif. are on the performance (time and memory). * In your accuracy comparison, why do you compare to starsolo and not cellranger? * STARsolo is the faster implementation of cellranger, cellranger internally uses STAR. * Q [Paul] Problem chemistry free from 10x but barcodes are 26 bases low. * 10x version v3 is 12 bases less. Either wrong version or missing few bases. Chemistries are only for the barcoding not the actual mapping. * Feature requests-> 10x flex * Q [Mikaela] Next steps? * Genome based aligner STARsolo. They are working on STARsolo faster, lot of unsolved problems on the processing -> ambigiuous counts (spliced or unspliced). Now people are working on solving teh ambiguity -> combine spliced and unspliced matrices together. Combine the matriced, uniformed representation of the data. STAR group improve the maximization algorithm. Improve bulk algorithm. Modalities for the same cell -> make sure the algorithms agree. ISMB solve the ambiguity in single cells. ## 2024-04-16 ### Attendees *Mikaela Koutrouli, Doron Haviv, Tim Treis, Alex Leung, Dana Pilkuska, Hong Jiang, Jeniffer Foltz, Martik Kim, Morgane Fierville, Quentin Blampey, Selman Ozleyen, Phillip Weiler, Sebastian Vanuytven, Sören Becker, Wojciech Lason ### Agenda * scverse conference advertisement * Doron Haviv talks about [ENVI](https://github.com/dpeerlab/ENVI) ### Notes * Q [Tim Treis] You showed a time vs sample graph -> is 1 "sample" on the x-axis 1 cell? * No. of Knearest neighbor do not seem to affect the results but it still should be tuned manually * Q [Alex Leung] Can this only work if scRNA experiment and spatial experiment are complimentary? Could we infer / impute data if I only have spatial experiment and want to use this with human atlas data? * Datasets not exactly complimentary but represented of the same contexed. It looks the same. Not sure if the atlas is very different but they haven't tried. It doesn't have to be 1:1 * Q [Dana Pikulska] Could you please elaborate a bit on how the imputation correlation was calculated? * Computer vision inspired: MSSIM --> look at gene expression in multiple dif. levels of the data and it gives another score of similarity. * Both metrics gave similar results but ENVI better. * Q [Dana Pikulska] has been testing and she has issues with RAM memory. * You need ot add a batch key because otehrwise all teh data seem to be from one sample. You should run it individually for each sample. * He will add this hyperparameter. * Q [Dana Pikulska] Spearman correlation was also bad. Do you think that it has to do with the tissue? * Mouse brain and mouse embryo intestine (?). The reason of brain data is because they are easy to image. * Q [Dana Pikulska] Some gene expression is very low - less than 10% of the cell in a cell type- after ENVI the expression is not low. Why? * When we run thi we try to make it as broad as possible. The only preprocessing was removing MT genes, filtering doublets, all the rest are the datasets as they are. Probably that leads to some bias even though they are trying to avoid it but imputation depends a lot on the quality of the dataset. ENVI doesn't assume that data are loged. When ENVI score is high, harmony score is high etc, --> some factors affect the performance. * Q [Dana Pikulska] No. of counts play role? * Not specific no. In one of the datasets (brain) but very small and old, sparce. We used the dif. model distribution * Q [] Why is it important to infer the genes not in your panel? I thought with 1k+-plex panels that are starting to get available, could you not just work on the intersection of your single cell assay and spatial assay genes? * I think the imputation is outdated. If you are super interested in specific expression, just panel. Add it to you panel. The noveltyof * Q [Anonymous] How big of a focus does enVI put on selling the imputation aspect? There are quite a few more methods which you could compare against that also use VI and perform reasonably well, like MEFISTO * Lots of methods are doing that. Lot of work. Lots of methods we could compare to but depeneds on datasets, etc and you cannot test everything. No big focus on selling of imputation. * Q [Selman Ozleyen] do you plan on sharing your code you used to get the imputation correlation results in the paper? It would be also good to see how the datasets were processed to reproduce the results * Sure. ## 2024-03-19 ### Attendees *Mikaela Koutrouli, Constantin Ahlmann-Eltze, Dmitrii Severinov, Severin Dicks, Raphael Heilig, Alex Fernandes, Mariia Minaeva, Jennifer Foltz, Valentine Svensson, Javier Marchena Hurtado, Wojciech Lason, Vivien Goepp, Anna Vathrakokoili Pournara, Can Ergen, Aidin Foroutan, Christoph Kuppe, Mariela Cortes Lopez, Paul Keissling* ### Agenda * Constantin Ahlmann-Eltze talks about [LEMUR](https://github.com/const-ae/lemur) ### Notes [] Poll: sckitil learn objects vs AnnData with extra fields vs Subclass of Anndata --> Winner: AnnData with extra fields [Dmitrii Severinov] Low dimentionality: How? And how fast/scalable LEMUR is? 20k cells? 100k? -Latent space is a linear subspace, which captures variation for each condition (each row is a condition). E.g. glioblastoma, fits a separate PCA 20-50 dims to the control conditions and treated conditions --> linear. Thus, is very scalable. SVD functions, pull request for GPUs to make it even more scalable. Zebrafish has 700k cells --> runs on a cluster, quite fast. PCA fits. The most challenging is the dif. expr. predictions because they are dense matrices. If millions, the problem. [Can] Can you correct for covariates? Additional covs that you care about. -Yes, it follows the idea of linear models You can just include the extra covs in the obs field. And for dif. expr. test youc ompare the two levels you care about. [Can] When you do confirmation using pseudobulk, what about the p-value? -The R package has an argument for the minimum neighborhood size to avoid the effect that some point there will be so few cells that it won't be significant. Not so common that you have super strong dif. expr. values. Not yet seen clusters with 50 cells. Interesting though to check further. [Mikaela] VAEs? Non-linear. -Better predictions because of non-linear. Conceptually, what does it mean if you have a subspace? How do I then do the mapping. CPA, scGen loses a bit of the interpretability. Few tuning for my model. Much better performnce than CPA. [Dmitrii Severinov] scRNA-seq data from several timepoints and several cell types. Experimental protocol (...) zebrafish example used splintz(???). You can do it in pseudobulk level, pyDeseq also. Gradiance between dif. cell types. ## 2024-03-05 ### Attendees *Emma Dann, MD Babu Mia, Sarah Castro, Olympia Hardy, Hong Jiang, Arda Karaoglu, Jin Wong, Jennifer Foltz, Francesca Drummer, Daniel Dimitrov, Jens Mayer, Anna Schaar, Natacha Comandante, jiri, Jacopo Munaretto, Olli Dufva, Tianyi Lu, Ori Moskovitz, Vivien Goepp, Rui Sun, Dimitrios Kyriakis, Peter Li, Wojciech Lason, Carla Ares, Lucie Marie Hasse* ### Agenda * Robert Petryszak talks about [CellPhoneDb](https://github.com/ventolab/CellphoneDB) and [CellphoneDB Viz](https://github.com/ventolab/cellphonedbviz) ### Notes [Hong Jiang] can we consider not just mean expression in cts, but weight by relative levels of expression of ligand and receptor in the same cell [Hong Jiang] Are there ways to append custom l-r pairs to the existing database? No [Emma] You mentioned cell type abundance imbalances are an issue for statistical testing. What's the solution? Simply downsampling? No unified solution yet, warnings could be incorporated in analysis functions [Sarah Castro] Can CellPhone be run considering replicates? I.e. which cell comes from which sample? Currently no, one workaround is to aggregate data by cell type and sample if the number of samples is large. ## 2024-02-20 ### Attendees *Phil Angerer, Mikaela Koutrouli, Nick Peterson, sbhatara, Mark Keller, Louis El Khoury, Jan Otoničar, Daniel Dimitrov, Shirley Dixit, rico, Vitalii Kleshchevnikov, Jennifer Foltz, Jeffrey Pullin, Roshan, Natacha Comandante-Lou, IH Lin, Little Stamp, Liang Ding, Sen Guo, Tim Treis, Martin Kim, Dimitris V* ### Agenda * Vitalii Kleshchevnikov talks about [cell2location](https://github.com/BayraktarLab/cell2location) ### Notes * [Natacha Comandante-Lou] In constructing the snRNAseq reference, which of the following is recommended: - pool all cells from all samples together, then train each tissue slides with this same reference or - using matching sample to create a reference for each sample, then train each tissue slide with its corresponding reference? * Vitalii: Depends on the size of your data, or the biological question. In case you have thousands of cells, better pool all cells from all samples together, then train each tissue slides with this same reference -> computationally * Vitalii: If you have dif. patients, then separate, otherwise just pool them. You get more power in your analysis * [Mikaela Koutrouli]: What if you don't have the image of the spatial data (e.g. slide-seq) - Vitalii: it's tricky of you don't have the anatomical area, you can rely on similar tissue cartoons like the existing ones, otherwise you have to use landmark cell types. We find to benefitial to haev histology images and not to rely on clustering by NMF or leiden clustering. We should try having the tissue image. * [Mikaela Koutrouli]: How are you planning to use teh CCI for? - Viytalii: you have RNA abundance but you don't know the intermediate steps. You get distance functions between cells that have receptors and cells that have signals. We need models that predict protein abundances around cells that have receptors from the distrobution of signals an dthen a model taht predict occupancy. ## 2024-02-06 ### Attendees *Mikaela Koutrouli, Isaac Virshup, Daniel Dimitrov, Emma Dann, Max, Zijian Fang, Christian, Nathan, Valentin Marteau, Mariia Minaeva, MD Babu Mia, Daniel Lopez, Tim Treis, Inacio Medeiros, Neha Daga, Jennifer Foltz, Sara,Natacha Comandante-Lou, Likasz Boryn, Khalique Newaz, Likas Heumos, Lukas Mahieu, Paul Mauizio, Vasilis Konstantakos, Hong Jiang, Martin Kim, Niko Fleischer, Bastienne Rehor, Marco Varrone, Merel Kuijs, Hong Jiang, Jim Koumadorakis, Haoye Yang, Daneile Bottazzi, Mark Keller, Saad Khan, Mohammed Zidane, Calogero Carlino, Darius* ### Agenda * Daniel Dimitrov will discuss the Multi-condition cell-cell communication using [LIANA+](https://github.com/saezlab/liana-py) ### Notes * [Tim Treis] How are the factors being determined? MEFISTO? * Just the NMG step * [Hong Jiang] How do we set the number of factors? * there's a heuristic for choosing number of factors * follow up: is it worth using an specific set of features * Can use highly variable features * [Natacha Comandante] * [Emma Dann] How does Liana model multiple factors, testing in difference between source and target cells vs changes in factors * ## 2023-12-12 ### Attendees *Mikaela Koutrouli, Isaac Virshup, Felix Fischer, Anna Schaar, Jennifer Foltz, Valentine Svensson, Davide Citaro, Can Ergen-Behr, Taobo Hu, Patrick Hanel, Paul Maurizio, Marco Varrone, Vincent Van Deuren, Jeremie Kalfon* ### Agenda scTab – Felix Fischer ### Notes * Q [Can Ergen-Behr]: do sampling strategies effects output, specifically on seen cell type determination * Cell types are actually based on given labels from cellxgene, so there shouldn't be an effect on detection * Q [Can]: Does model fitting need fancy GPU? * Should be okay with smaller GPU * Q [Marco] Have you looked at fine tuning for spatial? * No, but have signifigant doubts * [Davide] Is there a minimum per cell type per * [Valentine] Does the gene list need to be stadardized * Yes * ## 2023-10-03 ### Attendees *Severin Dicks, Isaac Virshup, Mikaela Koutrouli, Lukas Heumos, Giovanni Pallla, Elja Roellin, Maren Büttner, Cebrail, Illaria Billato, Sri Varra, Jennifer Foltz, Vincent Van Deuren, Davide Cittaro* ### Agenda rapids-singlecell ### Notes * ## 2023-09-05 ### Attendees *Boris, Constantin, Gregor, Lukas, Anna, Tim Treis, Ori, Eljas, Babu Mia, Severin, Kveta, Zhaleh Safikhani, Francesca Drummer, Martin Kim, Taobo Hu, Phil Angerer, Jie Quan, Danila Bredikhin, Matilde Conte, Anna, Paul Johnston* ### Agenda * https://github.com/owkin/PyDESeq2 ### Notes #### Questions * [Tim]: Are you looking for exact results? Why are there differences? * Differences are from underlying dependencies, e.g. scipy optimizers for pydeseq2 * Different choice in calulation of dispersion * DESeq thresholds * [Constantin] Are multiple covariates supported? * [Isaac] How did you find making a statistical library like this in python? * C++ code in deseq2 around GLMs was hard to replicate * Hardest part was identifying exactly how DESeq does outlier detection * Most wished there was a more scalable way to fit glms * [Gregor] how are you testing? what about updates? * Breaking changes have come from dependencies `glmGamPoi` * Plan to update * [Babu Mia] Do you want to add more/ go beyone DESeq2? * [Isaac] How does outlier detection work? * Cooks distance * Only * ## 2023-07-25 ### Attendees ### Agenda * Talk by Leah Wasser, pyopensci ### Notes * [Isaac] What are the bounds on package size? * numpy too big * but willing to go a little smaller than joss requirements * [Phil]: this stuff is great! Especially docs * [Gregor] How is finding reviewers? * Can be hard for more esoteric/ not covered topics * https://docs.google.com/forms/d/e/1FAIpQLSeVf-L_1-jYeO84OvEE8UemEoCmIiD5ddP_aO8S90vb7srADQ/viewform * [Gregor] How is continuing review? * Still in progress, but will have automated metric collection * What happens if standards change? * https://github.com/scverse/ecosystem-packages#checklist-for-adding-packages * [Isaac] How is the astropy transition? * Still proof of concept, starting with new pacakges * Will probably go back later * [Isaac] Ropen sci was an inspiration? * Cohesive feeling in community * Kartik of ropensci initially * [Isaac] where do we go to follow up * slack, but closed so only a few can join * discourse less active * [Phil] recommends https://discord.com/invite/pypa * Isaac: where can we follow what you're up to? * https://github.com/pyOpenSci/pyopensci.github.io/pull/207 * https://github.com/pyOpenSci/software-peer-review/issues/226 * Pangeo partnership page: https://www.pyopensci.org/software-peer-review/partners/pangeo.html ## 2023-07-11 ### Attendees *Mikaela Koutrouli, Isaac Virshup, Gregor Sturm, Zoe Piran, Paul L Maurizio, Danila Bredhikin, Taobo Hu, Jennifer Foltz, Evan Lyall, Philipp Angerer* ### Agenda * Zoe Piran: [Biolord - biological representation disentanglement](https://biolord.readthedocs.io/en/latest/) ### Notes #### Questions: * Mikaela – go over classifiers again * Gregor – What were the points in the embeddings * Jennifer – Can you use multiomics for biolord classify * Yes for embedding, not for impute of other modality * ## 2023-06-27 ### Attendees *Geert Jan Huzing, Mikaela Koutrouli, Gregor Strum, Valentin Marteau, Eljas, Maria Puschhof, Zoe Piran, Lorenzo Merotto, Carson Poltorack, Jennifer Foltz, Ligan Zhu, Dominik Klein, Zhaleh, Danila* ### Agenda * Geert-Jan Huizing will talk about Mowgli, a method for the integration of paired multi-omics data ### Discussion points: ### Notes Questions * Gregor: how well does this scale out * Isaac: any ideas for how we can better benchmark these kinds of methods * It's hard * Maybe downstream tasks better? * Zhaleh * Why doesn't the method look as great in the open problems benchmark as with Simulated? * Basically is messier * Isaac: NMF * Have seen that is is stable * Reguarization helps a lot * Zhaleh: why doesn't Seurat show up in some of the plots? * Seurat doesn't provide an embedding space, so there's no way to measure sillouhette score * Zhaleh: Speed of methods? * Depends on the type of data, Mowgli can be slow with a high number of features * Carson (text question): In your NMF you showed that the more "interpretable" programs were ones that had cell-type specificity but is Mowgli able to find factors that are common to a few subsets of cell types in the overall dataset? like if there's a "cell process A" factor, and there are 4 (of 12) cell types in the dataset whose ground truth is high for cell process A, how would you distinguish that factor being meaningful vs nonmeaningful? * Example of features showing up ## 2023-06-13 ### Attendees (23) *Mikaela Koutrouli, Eljas Röllin, Marco Varrone, Valentin Marteau, Tu Hu, Maria Puschhof, Maren Buettner, Juan Luis Cadavid Cardenas, Constantin Ahlmann-Eltze, Taobo Hu, Soufiane Mourragui, Lena Morrill, Lina, Jena Foltz, Martin Kim, Vittorioz, Zhaleh, Christian Feregiano* ### Agenda * PauBadiaM: talking about pseudobulk and differential expression analysis w decoupler ### Notes * [Mikaela] Google form for feedback * Questions on Talk * [Tu HU] May I get the slides? * [Maren] How to integrate mixed effects * Not sure if pydeseq2 supports that * [Maren] Differences in abundence changing results of DE * Yeah, it's a problem, other methods can work around it but can be computationally expensive * [Tu Hu] Batch effect correction via MNN * [Tu Hu] P-values for * [Marco Varrone] Using prior knowledge for pathways * Available through omnipath package * [Valentin] Lines on volcano plots * Log fold change shifted to zero * [Constantine] Can occur to small discrete values that show up in single cell, e.g. small counts * [Jennifer] Cusom marker genes? * Just need to define a "source" * [Mikaela] Scaling of decoupler? * Python package is faster * [Lina] Generation of pseudo bulk – do you account for cell number differences, or DE-seq? * Assume de-seq normalization handles this * Have not explored though * [Isaac] I have seen effects from this * Pau: could regress number of cells? But pydeseq doesn't have that * [Isaac] Is using just a mean to describe a psuedobulk enough? Or do we need to use at least a need a mean and variance * Pau: could be bettr but ## topic ideas for future meetings * genomic ranges follow-up * out-of-memory/out-of-core access in AnnData * awkward arrays in AnnData and their use-cases * spatialdata ## 2023-05-30 ### Attendees *Giovanni Palla, Maren Buettner, Isaac Virshup, Alexander Malt, Clarence Mah, Cordata di vite, Emanuel Soda, Gregor Sturm, Kevin Lebrigand, Marco Varrone, Martin Fleischman, Matt Rowe, Niklas Muller-Botticher, Rahul, Ross, Sebastian, Taobo Hu, Veronika, Paul Kiessling, Mikaela Koutrouli, Levi John Wolf, Emily Laubscher, Martin Kim, Wanqui Zhang, Thomas Doughetry, Shashawat Sahay Danila Bredikhin, Lukas Heumos* ### Agenda * SpatialData presentation * [twitter thread](https://twitter.com/LucaMarconato2/status/1656239450131660800?s=20) * [docs](https://spatialdata.scverse.org/en/latest/) * [preprint](https://www.biorxiv.org/content/10.1101/2023.05.05.539647v1) ### Notes * Q [Maren] How does this interact with squidpy? * [GP] Should integrate well due to same data model for anndata * [Gregor Sturm] How does the format relate to OME-Zarr * [GP] It's OME-Zarr + a few things, those are being expanded on * [Gregor Sturm] How do we work with multiple patients/ * [Marco Varrone] How does this work with multiple modalities * MuData not right now, but is a priority * [Martin Fleischmann] Geopandas dev, what kinds of stuff do we need, whats the intersection * [Clarence Mah] * [Levi John Wolf] https://rtree.readthedocs.io/en/latest/ The geopandas implementation would be here: https://geopandas.org/en/stable/docs/reference/sindex.html * Multilevel representations of polygon * https://github.com/mattijn/topojson * https://github.com/felt/tippecanoe * Vector Tiles * https://github.com/zarr-developers/geozarr-spec * [Mikaela] Cell type proportion ## 2023-05-16 ### Attendees *Gregor Sturm, Ben Parks, Valenin Marteua, Maren Buettner, Jayram Kancherla, Davide Citarro, Martin Kim, Danila Bredikhin, Isaac Virshup, Babu Mia* ### Agenda * Genomic ranges: possible implementations (bioframe vs. pyranges vs. biocpy GenomicRanges) * presentation by Jayaram Kancherla on BiocPy [GenomicRanges]([GenomicRanges](https://github.com/BiocPy/GenomicRanges)) ### Notes * Biocpy * bioconductor datastructures are "backwards compatible at all cost" * ArtifactDB: Schema based storage of bioformats * + Metadata * Goals of biocpy * Align with artifactdb representations * Interop between R and python * BiocFrame * Being a little more flexible on column types * Nested dataframes * GenomicsRanges * RangeSummarizedExperiment * a transposed anndata * There's also a javascript version of Bioconductor datastructures * Q: Bioframe * Q: [IV] Is Artifact Db available? * Not yet * Building on alibaster * So a bunch of files, with metadata about how to read them in * Q: [GS] Goals of biocpy * A datastructure similar to bioconductor to get things into * Q: [IV] How flexible on Genomic ranges support in AnnData * IV: bioframe is nice because it just operates on a pandas data frame. Also integration with open2c and seems pretty active * BP: data should be easily accessible from other libraries, possibility to index important * Nested list - grouped genomic ranges - More common in genetics, variants by point ## 2023-05-02 ### Attendees *Davide Citarro, Gregor Sturm, Isaac Virshup, Lukas Heumos, Emma Dann, Adam Gayoso, Giovanni Palla, Cordate de Vite, Can Ergen-Behr* ### Agenda * Flash talk by Davide Citarro: [schist](https://schist.readthedocs.io/en/latest/tutorials.html) * Hackathon recap ### Notes #### Presentation * Identifying cell groups * Stochastic block model * Nested formulation overcomes finding small communities in large graphs * https://graph-tool.skewed.de * Similarity to CellRank * "cell marginals" similar to transition probability matrix * Multiomic applications * Can use multiple graphs for multiomic integration * Use optimal transport for calculation of feature / feature dissimilarity, example of ATAC seq * Limitations * Speed * Bootstrapping on multiple subsamples * GPU – possible, but not yet * [Q: IV] How to get people to use this? * Speed biggest current barrier * [Q: GP] Is it time or memory bound? * Can be pretty memory efficient, but limited by scaling out * [GP] Any optimizations/ approximations * It's a variational inference problem * [Q: GP] How about metacells for speed up * A metacell is kinda like a block * But looking at dropping the whole knn graph, using a bipartite graph of genes to cells * [Q: IV] What is the API for interacting with the tree of * Currently only work with one level of hierarchy * [Q: IV] is performance slow once, or multiple times * Throughout analysis * Priors can really cut down ## 2023-04-18 ### Attendees *Adam Klie, Gregor Sturm, David Laub, Valentin Marteau, Maren Buettner, Trevor Manz, Sebastian Lobentanzer, Mark Keller, Danila Bredikhin, Emma Dann, Martin Kim, James Cranley, Garrett Ng, Jan Engelmann, Alinda Frolova, Laura Martins, Davide Cittaro, Can Ergen-Behr, Andrew McClusky, Kazumasa Kanemaru* ### Agenda * Genomic Ranges (presentation by Qi An) * Accessing genomic assembly metadata via SQL tables (IV) * E.g. https://gist.github.com/ivirshup/9cb994e473f9f6f1325a361195d4d7a6#file-transcript_models_duckdb-py * Ranges in chame ([notebook](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb) demo by DB) ### Notes * An Qi – DFKZ Heidelberg * Genomic range usage * Future issues: * Categorical bug * * [IV & DB] Should work, may be a problem on our side * Peaks called from samples not overlapping * [IV] Alternatives? Bins * [DC] Use pre-existing annotations * [MB] Alternative to macs2: genrich * DMR * IV: Package for genome annotations based on bioconductor sqlite files * Maren - ChIPseeker annotate chip seq peaks * annotation source? II: pulls from ensembl * Emma * Translating ids is quite useful * Would also like to work with regulatory info * ensembl regulatory information * Distance to genes * Sebastian Biocypher * Knowledge graph connecting genomic annotation information * [github.com/saezlab/biocypher](https://github.com/biocypher/biocypher) * https://github.com/orgs/saezlab/projects/5 * Adapters to existing databases * https://biocypher.org/adapters.html#biocypher-meta-graph * DB: * Demo on chame * https://github.com/gtca/chame/ * ED: solutions for plotting? * e.g. "genome browser views" * DB: there's an IGV jupyter widget https://github.com/igvteam/igv-notebook * DC: deeptools https://pygenometracks.readthedocs.io/en/latest/ * AQ: [gosling](http://gosling-lang.org/examples/) * * TM: * higlass tools * https://github.com/higlass/higlass-python * Laura * Viewing gene information would be nice ## 2023-04-04 ### Attendees *Gregor Sturm, Alina Frolova, An Qi, Divyanshu Srivastava, Andrew McCluskey, Phillip Angerrer, Lukas Heumos, Valentin Marteau, Isaac Virshup, Adam Gayoso, Martin Kim, Mikaela Koutrouli* ### Agenda * [Functional Associations using Variational Autoencoders (FAVA)](https://github.com/mikelkou/fava) (Flash talk by Mikaela Koutrouli) * Separate IO packages (GS) * issue for scirpy: https://github.com/scverse/scirpy/issues/385 * was also being discussed for scanpy + other modalities * Pytorch data loaders (IV) * scverse example datasets (GS) ### Notes * FAVA (Flash talk by Mikaela Koutrouli) * limitation of current protein-protein interaction networks: biased source, only contains information about well-studied proteins * --> reconstruct them from OMICS data * naive approach: co-expression (pairwise correlation) of proteins * discussion * Q isaac: how to get from latent space back to genes? * Recommended data source [DepMap](https://depmap.org/portal/) * IO package * central package or split up by modality? * Genomic ranges in anndata * An Qi – Presenting next meeting * currently using bioframe * danila: chame ([example](https://github.com/gtca/chame/blob/main/docs/examples/ranges.ipynb)) * pytorch dataset * map style: numpy, iter style * AnnData field validation (abstract ) * Biggest barrier is out of core at the moment * First example could be "one anndata on disk" -> pytorch data loader * How important is shuffling? * Theoretically quite important, but not much concrete examples ## 2023-03-21 ### Attendees *Gregor Sturm, Giovanni Palla, Dylan Lam, Luis Omar Correa, Clarence Mah, Danila Bredikhin, Mario Acera, Lukas Heumos, Isaac Virshup, Valentine Svensson, Haarshaadri jp, Hugh Warden, Luca Marconato* ### Agenda * Bentotools (Flash talk by Clarence Mah) * State of genomic ranges (IV) (https://github.com/scverse/anndata/issues/624) ### Notes * Bentotools – https://github.com/ckmah/bento-tools * Clarence Mah – UCSD * RNA biology, especially organization of RNA within a cell * New updates: * RNAFlux – Find subcellular domains with local expression * Qs: * Spatial domains defined by clustering, in this example defined by SOM * [GP]: Ripley statstics in squidpy assume evenly distributed points. What's available in python? * [CM]: Currently using colocation analysis metrics quite heavily * [IV] How has performance been going with respect to dataset-size? * Will be a problem going forward! * Genomic ranges * GS: can awkward arrays represent ranges? ## 2023-03-07 ### Attendees *Danila Bredikhin, Babu Mia, Gregor Sturm, Giovanni Palla, Jesko Wagner, Isaac Virshup* ### Agenda * Testing against release candidates (isaac) ### Notes * MuData in squidpy? * DB: not quite * GS: Spatial Airr data, visium based\ * https://nbviewer.org/github/soph-liu/Slide-TCR-seq/blob/main/Slide-TCR-seq%20Analysis%20and%20Figures.ipynb * IV: How do people want to do spatial multimodal integration? * IV: Testing against release candidates * GS, IV: Can we remove the red X for a test that's okay to fail * doesn't seem like this is possible. Closest thing that could work: return success always, and automatically add a comment that states the optional tests failed. * IV: Limitation – see if new release breaks downstream packages * Cron job? Stops after 60 days * GP: hosting of data with aws credits * IV: yes! * Distribution may need more thought, but lets see how much it costs ## 2023-02-21 ### Attendees *Floarian Heyl, Isaac Virshup, Philipp Angerer, Clarence Mah, Jesko Wagner & Hugh Warden, Gregor Sturm, Giovanni Palla, Tim Treis, Kevin Lebrigand, MD Babu Mia, Anna Diamant, Andrew McCluskey, Adam Gayoso, Laurens Lehner* ### Agenda * Jesko: lightning talk * Clarence: bento-tools’ SpatialData compatibility * Context: According to the [documentation](https://spatialdata.scverse.org/en/latest/), beta is coming soon; bento-tools is very interested in refactoring to use SpatialData instead of heavily modded AnnData to support its data structures. SpatialData meetings would be more appropriate to discuss this, but unable to attend (1AM PST). * Isaac: scanpy GPU support – what does this API look like? * Context: [post on scverse zulip](https://scverse.zulipchat.com/#narrow/stream/316218-repo-management/topic/sklearn.3A.20compute.20backends/near/326834041) and ["Patterns for GPU support" conversation on the Scientific Python discord](https://discord.com/channels/786703927705862175/1072596591363489792) * Isaac (optional): Benchmarking setup for core projects, how do we run this? * Context: [Using asv](https://asv.readthedocs.io/en/stable/), how do we call for benchmarking runs on private infrastructure? We will have a server (with GPUs) to run benchmarks + a portal server to recieve requests * Gregor (optional): Integration tests: Leveraging GitHub actions' `cron` feature to regularly test core packages against development versions of dependencies? ### Notes * Intros * Jesko talk Drug Discovery from High Content Images * ![overview slide](https://i.imgur.com/J0bjRzK.png) * Single cell analysis w/ high content images * 10-100m treated/ stained cells * Feature extraction w/ squidpy, cellprofiler * scmorph * Qs: * GP: How are you working with the images? Just features in the anndata? * Not doing as much with the images once they hit anndata * Loading data into a db? Maybe, but not so much of a problem yet * GP will put in contact with *fractal(?)* developers * IV: How does cell segmentation/feature extraction work? * currently not custom solution, relying on cellprofiler/squidpy * CM: Bento-tools + spatial transcriptomics * https://bento-tools.readthedocs.io/en/latest/ * Interested in using spatial data, but haven't yet * Particular needs: * Spatial subsetting * Big usage of polygons * GS: Differences from squidpy? * Subcellular + mrna localization * KL: clarification of use cases * colocalization of transcripts, proximity of transcripts to cellular compartments * IV: How do you identify subcellular features * GP: FISHfactor – featurizing subcellular distribution * TT: https://www.biorxiv.org/content/10.1101/2021.11.04.467354v1 * CM: Currently more looking for known features than unsupervised approach, know features like lack of transportation * IV: GPU backends for scanpy * TODO: link to slides * which backends to support? RAPIDS, pytorch, jax * What should be the priorities? * Different models for GPU support * Integration into scanpy * sepearate package that mimicks API * plugin system that adds functionality to scanpy * Questions/Discussion * PA: Activating GPU support should always be explicit (explicitness includes operating on a implementation-specific data structure like CuArray from RAPIDS) * AG: Scanpy is just a vendor of other libraries for basically all algorithms. * PA: Not everything needs to be GPU: current numba use cases could be e.g. Rust instead. This would lower the maintenance burden of multiple GPU backends as fewer functions would need it. * GS: Could move from seperate packages (explorative) to plugin system * IV: libraries that have done good support for both cpu and gpu? * CM: [TensorLy](http://tensorly.org/stable/index.html) * AG: [Aesara](https://aesara.readthedocs.io/en/latest/) * AG: what about out-of-memory data? * IV: rapids has good integration with dask * GS: Testing again dev versions of upstream packages * nightly builds to catch dependencies breaking our package? * use `pip install --pre` to catch that early? * TODO: figure out how notification for nightly builds work? ## 2023-02-07 ### Attendees *Valentine Svensson, Gregor Sturm, Jesko Wagner, Philipp Angerrer, Danila Bredikhin, Giovanni Palla, Harald Voehringer, Noorsher Ahmed (Bento tools), Isaac Virshup* ### Agenda * What do we want out of these meetings? (IV) * Recap of progress since launch (cookiecutter, ecosystem listing, paper, ...) (GS) * Tutorials page (https://github.com/scverse/scverse-tutorials/issues/43) (GS) ### Notes * Intros * What do people want out of this meeting * JW: Want's to know plans for scverse, where things are header * DB: Feedback from the community * Synchronus catch up on issues * JW: lighning talks * Working on a number of imaging problems, could talk about * NA: Keeping up on development with * VS: Updates on current priorities * suggests to always start meetings with short recap of current developments * HV: Policies for scverse projects, best practices * IV: https://github.com/scverse/ecosystem-packages * Recap of projects (GS) * IV: Pertubation data * JW: Thinks things are generally do-able * VS: Estimating effect size, fancier linear models * Currently using zellkonverter, talking to lme4 brms * brms generates stan code * GP: pyMC? * has used this in the past * But brms model language is great – special additions for splines, hierarchichal models, spatial correlation * Systems of models * IV: pyDESeq2 * VS: Currently does simple models, group vs group * GS: https://github.com/scverse/governance/pull/44There are PRs for more complex models * VS: Package for extracting model fits * IV: Genetics data * VS: Datastructures * GP: Person at helmholtz who we should follow up * GP: Can we just call R? * Interop? * JW: Can we call julia/ how is julia GPU support? * IV: Stats great, GPU not * VS: Has used MixedModels.jl, liked it * GS: Tutorials for scverse packages * How do we select tutorials, where does this go * https://github.com/scverse/scverse-tutorials/issues/43 * VS: suggestion, put links to tutorial sections/ use cases clearly visible on landing page * GS: Overview page on site, then seperate sphinx site * NA: Would like to have bento-tools visible here, but also how * NA: similar to ecosystem checklist, have tutorial checklist * GS: see https://github.com/scverse/governance/pull/44

Read more

AnnData Dev notes

Sparse summit follow up meeting

2023-06-12

scverse governance meeting notes