Tax changes === ## Feb 20 Update Remaining to do: ### LIN: - [x] use and test `--LIN-position` with `tax metagenome` - [x] handle `human` output for `--LIN-taxonomy` and no specified rank - [x] report at lowest rank to mirror 'species' default - [x] Change LINLineageInfo for easier "empty" initialization - [ ] LINgroups ordering: use build_tree + find_lca - [ ] better: build tree with just the LINgroups? - do I need to go up the dictionary tree? use ctb code but store the whole LineageInfo?? not sure... - [ ] `tax genome` - [ ] use `--LIN-taxonomy` + test - [ ] use `--LINgroups` + test - [ ] use `--LIN-position` + test - [ ] `tax grep`? - [ ] `tax annotate`? - [ ] cleanup:: move some `BaseLineageInfo` methods -->`RankLineageInfo`?? - [ ]** decide 0-indexing vs. 1-indexing LIN position!!!** ### CAMI: - [ ] make output fn work - [ ] test output - [ ] check against current CAMI format (this is based on Luiz's PR last year) ### Taxonomy Lab Meeting - What is `sourmash taxonomy` how is it useful for folks? (what does it do?) - What are the limitations? - Why add `LIN` and `CAMI` functionality? - demo? idk idk ## previous to-do ### CAMI example output: ``` ##CAMI Submission for Taxonomic Profiling @Version:0.9.1 @SampleID:SAMPLEID @Ranks:superkingdom|phylum|class|order|family|genus|species|strain @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 98.81211 2157 superkingdom 2157 Archaea 1.18789 1239 phylum 2|1239 Bacteria|Firmicutes 59.75801 1224 phylum 2|1224 Bacteria|Proteobacteria 18.94674 28890 phylum 2157|28890 Archaea|Euryarchaeotes 1.18789 91061 class 2|1239|91061 Bacteria|Firmicutes|Bacilli 59.75801 28211 class 2|1224|28211 Bacteria|Proteobacteria|Alphaproteobacteria 18.94674 183925 class 2157|28890|183925 Archaea|Euryarchaeotes|Methanobacteria 1.18789 1385 order 2|1239|91061|1385 Bacteria|Firmicutes|Bacilli|Bacillales 59.75801 356 order 2|1224|28211|356 Bacteria|Proteobacteria|Alphaproteobacteria|Rhizobacteria 10.52311 204455 order 2|1224|28211|204455 Bacteria|Proteobacteria|Alphaproteobacteria|Rhodobacterales 8.42263 2158 order 2157|28890|183925|2158 Archaea|Euryarchaeotes|Methanobacteria|Methanobacteriales 1.18789 ``` :::info To output, we need: - to have the NCBI `taxids` for all ranks per reference genome (**"taxpath"**). - to modify lineageDB to store taxid for each rank (use new LineagePair?) - For LINS, below, also need to enable LINS storage instead of/in addition to tax ranks ::: @ctb wrote code for lowest taxid, e.g. [2022-assembly-summary-to-lineages](https://github.com/ctb/2022-assembly-summary-to-lineages), output: ``` accession,taxid,superkingdom,phylum,class,order,family,genus,species,strain AAAC01000001,191218,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus anthracis, AABL01000001,73239,Eukaryota,Apicomplexa,Aconoidasida,Haemosporida,Plasmodiidae,Plasmodium,Plasmodium yoelii, AABT01000001,285217,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Aspergillus,Aspergillus terreus, AABF01000001,209882,Bacteria,Fusobacteria,Fusobacteriia,Fusobacteriales,Fusobacteriaceae,Fusobacterium,Fusobacterium nucleatum, ``` For **taxpath**, can we modify [ncbi_taxdump_utils.py](https://github.com/ctb/2022-assembly-summary-to-lineages/blob/main/ncbi_taxdump_utils.py) to obtain; store "taxpath" as another column in the lineages file? Other tax manipulation: - Luiz's [gather_to_opal.py](https://github.com/luizirber/2020-cami/blob/master/scripts/gather_to_opal.py#L158-L213) - Taxonkit - taxonkit [gtdb-taxdump](https://github.com/shenwei356/gtdb-taxdump) creates new taxids?? ### LIN-based summarization Instead of using named taxonomic ranks for summarization, we can use LINS. - WIP [LINSLineageInfo draft PR](https://github.com/sourmash-bio/sourmash/pull/2449) - **Need:** modify LineageDB to store taxonmic ranks lineageDB load_LINS - check for lIN taxonomy - summarize and do lins ### To do: **CAMI:** build-ncbi-lineages - [x] obtaining reference taxpaths: Modify https://github.com/ctb/2022-assembly-summary-to-lineages/blob/main/ncbi_taxdump_utils.py#L198 to store the taxpath as well (just use/return separate dictionary) - Updated workflow here: https://github.com/sourmash-bio/build-ncbi-lineages `sourmash LineageDB` PR - [x] store taxids/taxpath in `LineageDB` - [x] add taxpaths to a test lineage csv - [x] test out taxpath summarization, etc `CAMI` PR - [ ] build CAMI output method for `SummarizedGatherResult` **LINS:** `LINS` PR - add new load function, loadLINS - `LINLineageInfo` + associated tests -