sourmash + LINs meeting, 03/13/2023

--- tags: sourmash, LIN, lin, lins, taxonomy, sourmash taxonomy, sankey, alluvial --- sourmash + LINs meeting, 03/13/2023 === ## Feedback/concerns for sourmash + LINs: #### Columns are ~good, and options were understandable #### If no matches, gather does not produce any file as a result. - Can it produce a file? - we could add this; see https://github.com/sourmash-bio/sourmash/issues/2357 #### Threshold option not immediately obvious - Can we add a suggestion to try lowering the threshold if there are no matches? - And/or maybe add `--threshold-bp` suggestion to prefetch/gather help message? e.g. maybe add: > "If you don't get any matches, you can try lowering the match threshold using `--threshold-bp`" #### Gather's greedy selection is an issue when we're working with highly similar genomes w/ low abundance. - What can we do about it? - At minimum, notify about equivalent matches from prefetch step? - other ideas: optionally do gather assignments at species/lingroup level if genomes are too similar? - Can we add a "confidence threshold" column, which gives you some idea of how confident the call of this particular genome/lineage is? - not at present. Note to self & @ctb - think about YACHT confidence info? #### Will `sourmashconsumR` work for LINs? - not out of the box, ranks are hard-coded into each function for now. Modification may be possible. - Python code options: - I have code to produce a sankey diagram via plotly (modified from ctb original code). Could modify for LINs. - [code](https://github.com/bluegenes/2021-benchmarking-dev/blob/main/figs/sourmash_sankey.py); example output: ![](https://hackmd.io/_uploads/HJIrdXpJ3.png) - The notebooks are too big to render on github, but you can see them/ run them by following the instructions [here](https://github.com/bluegenes/2021-benchmarking-dev) - plus side: plotly is interactive, and may be a good option for genomeRxiv website viz. - minus side: not as pretty/ needs work to be pretty + still flexible over the full range of datasets that it could be used with ### Tessa's suggestions for Parul: 1. Try using `--threshold-bp 0` when you don't get any matches 3. Try using higher resolution (lower scaled) databases. I would think `scaled=100` would solve most of the issues, but I went a little crazy and made a `scaled=5` version you can download [here](https://osf.io/ybt75/download). You'll need to re-sketch your samples at the higher resolution (lower scaled). However, you can do this at, say `scaled=5` and then use those (and this database) with higher scaled by adding the `--scaled xx` parameter to `prefetch` or `gather` (sourmash will downsample on the fly). - Scaled is primarily a tradeoff between size /memory /search time and specificity. I'd recommend trying to use the highest scaled that gives you good results. Definitely look at `scaled=100` and `scaled=10` as better options for long-term use (instead of `scaled=5` :). ### Reza/Parul: Testing genome classification If you get a chance, it'd be great to try out the LIN genome classification too. It's in a separate PR. Testing instructions: Download an environment file that points to this branch: ``` curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/genome-lin.yml ``` Create a virtual environment using this file: ``` conda env create -f genome-lin.yml ``` Activate that environment, which is named `tax-genome-lins`: ``` conda activate tax-genome-lins ``` make sure `--lins` is in the `--help` for `sourmash tax genome`: ``` sourmash tax genome --help ``` ## Command to run The command to run is this one: ``` sourmash tax genome -g $gather_csv -t $taxonomy_csv \ --lins --lingroup $lingroups_csv ``` Note that the gather file used here should be the result of running sourmash gather using a genome query, since `sourmash tax genome` tries to annotate the file to the best lineage produced by `sourmash gather`.