In getting closer to publication, I split these CoV-IRT microbial subgroup related programs into their own conda package. The custom programs on this page that start with
bit-
will be replaced by versions that just start withcov-
that are included in that conda install. That should be installed with conda as shown on that page, and the installbit
instructions below should be ignored.
Creating and installing in a new conda environment:
Takes about 2 days for the centrifuge-build
command as run below.
This is ~22GB compressed and 36GB decompressed/unpacked:
Unpacking/decompressing:
Downloading one of our samples (CRR125950):
Running classification:
Running one without setting -k 1
:
The CRR125950-our-filtered-centrifuge-out.tsv
file holds the read-level results:
The CRR125950-our-filtered-centrifuge-report.tsv
file holds the taxon-level summary results:
And the "Abundance" column in that file sums to 1:
NOTE
- this seems to consider only what's classified in this file
- I'm not entirely sure how it's getting the relative abundances – maybe incorporating read-length and genome-size, but looking at a run with all 100-bp reads didn't make disentangling it any easier
That file also only has the genus and species, we can add the rest with one of my helper programs:
So the table now looks like this:
Can make a kraken-style report (defined here which has percentages of reads covered by each taxon (the percentages do not sum to 100 because they are reported for all ranks):
Running on one with -k 1
:
Column 3 does sum to the total number of reads in the sample (2,333,954):
So we could turn that into a proportion that includes unclassified, but we wouldn't have the same rank for everything, so would need to extend out further.
* E.g. in the above example, we'd have 465 reads assigned to Eukaryota, and those reads wouldn't have anything at the genus or species level if we were trying to summarize at those levels.
* Note to self: could take all rows where column 3 != 0; pull the read count and taxid; get full lineage info; then use the counts on any rank we want to summarize (this would just have "unclassified" for any read that didn't have one at that rank, which would be true at that rank, but maybe not a higher rank)
Running on one with no -k 1
:
I'm kinda digging the idea behind the --no-lca
option (see here), seeing what that looks like:
Running on one with -k 1
:
That one doesn't do anything different (meaning its output was the same as CRR125950-our-filtered-centrifuge-kreport.tsv
). I assume this is because it was run with -k 1
, so there were no assignments to split up into fractions (each read was forced to go to just one).
Running on one with no -k 1
:
This comes up a little lower than the total number of reads in the sample (2,333,954):
I don't know what to use.
Filtering to only keep reads of length 100:
Running classification on same-length reads with -k 1
:
Grabbing this one:
And I think the right info file, at the species level, from here:
Running centrifuge with -k 1
set:
Running centrifuge with no -k
set:
Result files can be downloaded with the following:
-k 1
setOf the 600,000 reads, 25,691 were left unclassified:
Peeking at the report tab, sorted by relative abundance (these top 20 hold > 99% of the estimated abundance):
name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
---|---|---|---|---|---|---|
Streptococcus thermophilus | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
Streptococcus pyogenes | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
Streptococcus pneumoniae | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
Streptococcus mutans | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
Streptococcus sanguinis | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
Veillonella dispar | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
Haemophilus influenzae | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
Streptococcus equi subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
Streptococcus salivarius | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
Streptococcus dysgalactiae | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
Streptococcus pneumoniae CGSP14 | 516950 | strain | 2209198 | 2618 | 2618 | 0.00635523 |
Streptococcus pyogenes MGAS10270 | 370552 | strain | 1928252 | 2250 | 2250 | 0.00587681 |
Streptococcus salivarius CCHSS3 | 1048332 | strain | 2217184 | 2844 | 2844 | 0.00469715 |
Haemophilus parainfluenzae T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
Streptococcus mutans GS-5 | 1198676 | strain | 2027088 | 359 | 359 | 0.000748662 |
Streptococcus salivarius JIM8777 | 347253 | strain | 2210574 | 683 | 683 | 0.000614505 |
Streptococcus sp. NCTC 11567 | 2583584 | species | 2147716 | 236 | 236 | 0.000529961 |
Streptococcus sp. FDAARGOS_192 | 1839799 | species | 2435494 | 488 | 488 | 0.000275644 |
Could take all with an abundance assignment, convert their taxids to lineages, combine at species- or genus-level, then will have to standard ranks…
name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
---|---|---|---|---|---|---|
Streptococcus thermophilus | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
Streptococcus sanguinis | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
Haemophilus influenzae | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
Streptococcus pneumoniae | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
Streptococcus mutans | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
Streptococcus | 1301 | genus | 2449574 | 46675 | 46675 | 0.0 |
Streptococcus pyogenes | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
Veillonella dispar | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
Streptococcus salivarius | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
Haemophilus parainfluenzae T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
Streptococcus dysgalactiae | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
Neisseria mucosa | 488 | species | 5008700 | 5479 | 5479 | 0.0 |
Neisseria | 482 | genus | 2223758 | 4680 | 4680 | 0.0 |
Veillonella | 29465 | genus | 2132142 | 4427 | 4427 | 0.0 |
Neisseria flavescens | 484 | species | 2231882 | 3101 | 3101 | 0.0 |
This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):
name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
---|---|---|---|---|---|---|
Streptococcus equinus | 1335 | species | 13680225 | 310 | 310 | 5.26736e-05 |
Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
Streptococcus equi subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
Streptococcus equi subsp. equi 4047 | 553482 | strain | 2253793 | 640 | 640 | 0 |
Streptococcus equi subsp. zooepidemicus CY | 1403449 | strain | 2107382 | 85 | 85 | 0.0 |
-k 1
setOf the 600,000 reads, 25,691 were left unclassified (same):
Peeking at the report tab, sorted by number of reads assigned:
name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
---|---|---|---|---|---|---|
Streptococcus thermophilus | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
Streptococcus sanguinis | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
Haemophilus influenzae | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
Streptococcus pneumoniae | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
Streptococcus mutans | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
Streptococcus | 1301 | genus | 2449574 | 46675 | 46675 | 0.0 |
Streptococcus pyogenes | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
Veillonella dispar | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
Streptococcus salivarius | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
Haemophilus parainfluenzae T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
Streptococcus dysgalactiae | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
Neisseria mucosa | 488 | species | 5008700 | 5479 | 5479 | 0.0 |
Neisseria | 482 | genus | 2223758 | 4680 | 4680 | 0.0 |
Veillonella | 29465 | genus | 2132142 | 4427 | 4427 | 0.0 |
Neisseria flavescens | 484 | species | 2231882 | 3101 | 3101 | 0.0 |
This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):
name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
---|---|---|---|---|---|---|
Streptococcus equinus | 1335 | species | 13680225 | 310 | 310 | 5.26736e-05 |
Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
Streptococcus equi subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
Streptococcus equi subsp. equi 4047 | 553482 | strain | 2253793 | 640 | 640 | 0 |
Streptococcus equi subsp. zooepidemicus CY | 1403449 | strain | 2107382 | 85 | 85 | 0.0 |