---
tags: sourmash, LIN, lin, lins, taxonomy, sourmash taxonomy, sankey, alluvial
---
sourmash + LINs meeting, 03/13/2023
===
## Feedback/concerns for sourmash + LINs:
#### Columns are ~good, and options were understandable
#### If no matches, gather does not produce any file as a result.
- Can it produce a file?
- we could add this; see https://github.com/sourmash-bio/sourmash/issues/2357
#### Threshold option not immediately obvious
- Can we add a suggestion to try lowering the threshold if there are no matches?
- And/or maybe add `--threshold-bp` suggestion to prefetch/gather help message? e.g. maybe add:
> "If you don't get any matches, you can try lowering the match threshold using `--threshold-bp`"
#### Gather's greedy selection is an issue when we're working with highly similar genomes w/ low abundance.
- What can we do about it?
- At minimum, notify about equivalent matches from prefetch step?
- other ideas: optionally do gather assignments at species/lingroup level if genomes are too similar?
- Can we add a "confidence threshold" column, which gives you some idea of how confident the call of this particular genome/lineage is?
- not at present. Note to self & @ctb - think about YACHT confidence info?
#### Will `sourmashconsumR` work for LINs?
- not out of the box, ranks are hard-coded into each function for now. Modification may be possible.
- Python code options:
- I have code to produce a sankey diagram via plotly (modified from ctb original code). Could modify for LINs.
- [code](https://github.com/bluegenes/2021-benchmarking-dev/blob/main/figs/sourmash_sankey.py); example output:

- The notebooks are too big to render on github, but you can see them/ run them by following the instructions [here](https://github.com/bluegenes/2021-benchmarking-dev)
- plus side: plotly is interactive, and may be a good option for genomeRxiv website viz.
- minus side: not as pretty/ needs work to be pretty + still flexible over the full range of datasets that it could be used with
### Tessa's suggestions for Parul:
1. Try using `--threshold-bp 0` when you don't get any matches
3. Try using higher resolution (lower scaled) databases. I would think `scaled=100` would solve most of the issues, but I went a little crazy and made a `scaled=5` version you can download [here](https://osf.io/ybt75/download). You'll need to re-sketch your samples at the higher resolution (lower scaled). However, you can do this at, say `scaled=5` and then use those (and this database) with higher scaled by adding the `--scaled xx` parameter to `prefetch` or `gather` (sourmash will downsample on the fly).
- Scaled is primarily a tradeoff between size /memory /search time and specificity. I'd recommend trying to use the highest scaled that gives you good results. Definitely look at `scaled=100` and `scaled=10` as better options for long-term use (instead of `scaled=5` :).
### Reza/Parul: Testing genome classification
If you get a chance, it'd be great to try out the LIN genome classification too. It's in a separate PR. Testing instructions:
Download an environment file that points to this branch:
```
curl -JLO https://raw.githubusercontent.com/bluegenes/2023-demo-sourmash-LIN/main/genome-lin.yml
```
Create a virtual environment using this file:
```
conda env create -f genome-lin.yml
```
Activate that environment, which is named `tax-genome-lins`:
```
conda activate tax-genome-lins
```
make sure `--lins` is in the `--help` for `sourmash tax genome`:
```
sourmash tax genome --help
```
## Command to run
The command to run is this one:
```
sourmash tax genome -g $gather_csv -t $taxonomy_csv \
--lins --lingroup $lingroups_csv
```
Note that the gather file used here should be the result of running sourmash gather using a genome query, since `sourmash tax genome` tries to annotate the file to the best lineage produced by `sourmash gather`.