Tax changes
===
## Feb 20 Update
Remaining to do:
### LIN:
- [x] use and test `--LIN-position` with `tax metagenome`
- [x] handle `human` output for `--LIN-taxonomy` and no specified rank
- [x] report at lowest rank to mirror 'species' default
- [x] Change LINLineageInfo for easier "empty" initialization
- [ ] LINgroups ordering: use build_tree + find_lca
- [ ] better: build tree with just the LINgroups?
- do I need to go up the dictionary tree? use ctb code but store the whole LineageInfo?? not sure...
- [ ] `tax genome`
- [ ] use `--LIN-taxonomy` + test
- [ ] use `--LINgroups` + test
- [ ] use `--LIN-position` + test
- [ ] `tax grep`?
- [ ] `tax annotate`?
- [ ] cleanup:: move some `BaseLineageInfo` methods -->`RankLineageInfo`??
- [ ]** decide 0-indexing vs. 1-indexing LIN position!!!**
### CAMI:
- [ ] make output fn work
- [ ] test output
- [ ] check against current CAMI format (this is based on Luiz's PR last year)
### Taxonomy Lab Meeting
- What is `sourmash taxonomy` how is it useful for folks? (what does it do?)
- What are the limitations?
- Why add `LIN` and `CAMI` functionality?
- demo? idk idk
## previous to-do
### CAMI
example output:
```
##CAMI Submission for Taxonomic Profiling
@Version:0.9.1
@SampleID:SAMPLEID
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 98.81211
2157 superkingdom 2157 Archaea 1.18789
1239 phylum 2|1239 Bacteria|Firmicutes 59.75801
1224 phylum 2|1224 Bacteria|Proteobacteria 18.94674
28890 phylum 2157|28890 Archaea|Euryarchaeotes 1.18789
91061 class 2|1239|91061 Bacteria|Firmicutes|Bacilli 59.75801
28211 class 2|1224|28211 Bacteria|Proteobacteria|Alphaproteobacteria 18.94674
183925 class 2157|28890|183925 Archaea|Euryarchaeotes|Methanobacteria 1.18789
1385 order 2|1239|91061|1385 Bacteria|Firmicutes|Bacilli|Bacillales 59.75801
356 order 2|1224|28211|356 Bacteria|Proteobacteria|Alphaproteobacteria|Rhizobacteria 10.52311
204455 order 2|1224|28211|204455 Bacteria|Proteobacteria|Alphaproteobacteria|Rhodobacterales 8.42263
2158 order 2157|28890|183925|2158 Archaea|Euryarchaeotes|Methanobacteria|Methanobacteriales 1.18789
```
:::info
To output, we need:
- to have the NCBI `taxids` for all ranks per reference genome (**"taxpath"**).
- to modify lineageDB to store taxid for each rank (use new LineagePair?)
- For LINS, below, also need to enable LINS storage instead of/in addition to tax ranks
:::
@ctb wrote code for lowest taxid, e.g. [2022-assembly-summary-to-lineages](https://github.com/ctb/2022-assembly-summary-to-lineages), output:
``` accession,taxid,superkingdom,phylum,class,order,family,genus,species,strain
AAAC01000001,191218,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus anthracis,
AABL01000001,73239,Eukaryota,Apicomplexa,Aconoidasida,Haemosporida,Plasmodiidae,Plasmodium,Plasmodium yoelii,
AABT01000001,285217,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Aspergillus,Aspergillus terreus,
AABF01000001,209882,Bacteria,Fusobacteria,Fusobacteriia,Fusobacteriales,Fusobacteriaceae,Fusobacterium,Fusobacterium nucleatum,
```
For **taxpath**, can we modify [ncbi_taxdump_utils.py](https://github.com/ctb/2022-assembly-summary-to-lineages/blob/main/ncbi_taxdump_utils.py) to obtain; store "taxpath" as another column in the lineages file?
Other tax manipulation:
- Luiz's [gather_to_opal.py](https://github.com/luizirber/2020-cami/blob/master/scripts/gather_to_opal.py#L158-L213)
- Taxonkit
- taxonkit [gtdb-taxdump](https://github.com/shenwei356/gtdb-taxdump) creates new taxids??
### LIN-based summarization
Instead of using named taxonomic ranks for summarization, we can use LINS.
- WIP [LINSLineageInfo draft PR](https://github.com/sourmash-bio/sourmash/pull/2449)
- **Need:** modify LineageDB to store taxonmic ranks
lineageDB
load_LINS
- check for lIN taxonomy
- summarize and do lins
### To do:
**CAMI:**
build-ncbi-lineages
- [x] obtaining reference taxpaths: Modify https://github.com/ctb/2022-assembly-summary-to-lineages/blob/main/ncbi_taxdump_utils.py#L198 to store the taxpath as well (just use/return separate dictionary)
- Updated workflow here: https://github.com/sourmash-bio/build-ncbi-lineages
`sourmash LineageDB` PR
- [x] store taxids/taxpath in `LineageDB`
- [x] add taxpaths to a test lineage csv
- [x] test out taxpath summarization, etc
`CAMI` PR
- [ ] build CAMI output method for `SummarizedGatherResult`
**LINS:**
`LINS` PR
- add new load function, loadLINS
- `LINLineageInfo` + associated tests
-