Prep for k-mer meeting

Prep for k-mer meeting === ## CTB meeting 04/11/24 - Metadata is terrible - Interpretation of metadata is terrible and annoying - People want to be able to use private databases Search works .. now what? What are we still missing? Where do we need to go? Branchwater web issue tracker — sub select samples you’re interested in How to organize/subselect the metadata -> Would be nice if we could agree on a common identifier scheme Use cases perspective: Here are 5 use cases that people want - Restrict samples they search - Federated-style search including private samples - Augment metadata with their own information (curated set of samples they care about) We differentiate from pebble scout b/c they can’t do private ‘ (functionality) ~20mins? ## Notes from mtg w/Luiz re talk: (4/23/24) - Maintaining flexibliity for metadata aggregation - Store metadata separately - STABLE IDENTIFIERS: would be neat to have a unifying identifier scheme - E.g. first space separated token? Can always hash these for efficient lookups  - Cross-tool compatibility based on stable identifiers - Concept of linking contigs —> identifiers - Make it easier to use the graph across tools. - This doesn’t mean they need to be hashed the same way - List of stable identifiers of datasets used. - NCBI Assembly Datasets = stable. Tax not stable - v. Useful for linking across tools - Look at rsrs https://github.com/COMBINE-lab/rsrs?tab=readme-ov-file + seqcol https://seqcol.readthedocs.io/en/latest/ - Federated databases - useful that we can combine. Also good for cross-tool usage - Search diff places with the same thing - Flexibility! Protocol - Any db with stable identifiers - Socks protocol - https://gitlab.ub.uni-bielefeld.de/gi/socks - Rocksdb —> design decisions to not reinvent kmer management - (Main point - take advantage of existing tools) - Maintenance is hard, don’t make it harder - Using all the reference genomes - With all the k-mers, we (mostly) can’t use all the datasets(yet!) - Maybe one you will have that solution, but afaik it’s not here yet - So, how to search ALL the datasets? —> FracMinHash - FracMinHash - You can do A LOT with subsampling the data - You can reuse all the theory, color-based indexes, etc — it all works with subsampling - Diff axis when thinking about performance. - Use case: allowing private sequencing collections - Building tools that enable working publicly or privately. - People doing data analysis can’t (e.g. human) or don’t want to share their data prior to analysis - A lot of data can’t be public (for good reasons) - Subselecting databases - Picklists, etc - Flexibility on query and search - Open development & maintenance helps drive sourmash directions - Doing maintenance drives innovation - B/c useful, used, flexible — we can adapt to diff use cases, etc - E.g. sylph - Software considerations: - Easy installation + documentation is critical for use - Conda/bioconda, etc - Tests + CI essential Guiding philosophy 1. Maintain flexibility on query and search 1. Only query what you want to query by allowing flexible selection on identifiers and metadata 2. Reuse things that database folks already developed (sql style queries) 5min - explain + show sourmash / branchwater 10mins - lessons learned Go with identifier for a dataset that is not going to change. This is not taxonomy — tax gets reevaluated and changed!