--- tags: sourmash, roadmap, branchwater, 2023 --- 2023 sourmash roadmap ideas === Here are the fundamental changes and advancements we have in mind for the sourmash software. Many of these require significant developer time and may change based on new research directions or observed need. We encourage community and user direction requests via our issue tracker! ## Multithreading of all utilities Status: in progress ## Improved binary storage Sourmash has benefitted greatly from storing sketches in standard JSON format (and as gzipped json files). ## Improved indexing methods ## Allow plug-ins for k-mer /sketch types ## Documentation Improvements Status: Started (describe ctb doc work) - can we use plug-in framework to test out new k-mer types and/or sketch types? ## Taxonomic Profiling and Classification; Functional Profiling - need to revamp testing --> test framework for any given set of ranks / taxonomic class - ICTV taxonomic ranks - LIN taxonomic framework - KEGG functional profiling - recipe: contig-level taxonomic classifiation recipe (multigather--> tax) ## Performance Monitoring While we currently have an automated microbenchmark framework, we will implement macrobenchmarks that report the resources required for search and classification for each new sourmash release and database. ## Priorities/Conceptual Directions: - higher res sketches - better binary storage - parallelism - on-disk indexed datastructures ## Database roadmap: - path to ~monthly updates to current dbs - add database generation/update information) - path to new databases - what higher res dbs? (virus) - protein dbs --> available - SRA db update: workflow + plan - better docs for building new db with taxonomy - Pooch(?) database registry to allow easy/robust db download - update/document wort so we can support? - path to adding protein sketches, higher res sketches (virus), etc? - path to adding plant/animal euk sketches (w/filters if needed) - filtered sketch benchmarks, etc (filter unique kmers) ## Software Advancements/Extensions - parallelize all the things (branchwater/mastiff) - Larger benchmarks to run on each release (CAMI?) - benchmark workflow! - Revisit documentation structure, add FAQ, etc - Web/desktop rs app (yew+tauri?) w/associated viz - allow web search of any provided sourmash databases, allow tuning, etc. - Seems feasible to template & then tailor for custom db's'. Plus, it's just something I (NTP) want to try in rust :). Tauri desktop app could be a really neat for accessibility. - Metadata integration/exploration tool - challenge: metadata is a mess. Relevant: https://academic.oup.com/bioinformatics/article/39/1/btac667/6971839 - Enable newer hashing /sketching types? - maybe just start with an example/plugin for how to add other hashing, sketch types, kmer types. E.g. split k-mer; jam-rs https://github.com/sourmash-bio/sourmash/issues/2710 - index plugin: sketch from fastas --> enables gather directly from FASTAs - + docs for how to build a plugin for a new database type ## Smaller Software/docs updates - keep tax file in zip file /add assoc utils - ictv taxonomy support - Docs: recommend collections and databases as default sig storage (bc manifests!), instead of individual files. Directory also ok if manifest. - Upgrade lca databases to use newer tax functions / build rust LCA database. This would also extend taxonomic ranks, allow ncbi taxids, etc. - sketch directly from web link instead of local fasta --> sketch fromfile setup to allow multithreading? - but maybe issues with download limits... - plugin: neighbor-joining tree viz - plugin: kspider clustering from mastiff db? - plugin: taxonomic benchmarking comparison - given ground truth + sourmash, estimate mismatches at each level --> F1 score, etc etc - or can we use/adapt cami tools for this? - would be useful to automate into a tax benchmarking workflow. - k-mer counting utilities for mastiff databases - taxonomy-based k-mer counting (e.g. unique to rank level, or specific taxon at rank, etc) - ... ## to do: - new landing page with big-picture vision, mission, features. - what sourmash offers - etc ## Manuscripts - Branchwater / Mastiff manuscript - describe software, benchmark? - Manuscript on taxonomy and LCA utilities - brief description + use cases - 2-3 pages? - Protein manuscript —> finish last section w/pyo3_branchwater utils; preprint and submit