---
tags: cov-irt
---
# Centrifuge reference database build
---
# UPDATE
> **In getting closer to publication, I split these CoV-IRT microbial subgroup related programs into their [own conda package](https://github.com/AstrobioMike/CoV-IRT-Micro). The custom programs on this page that start with `bit-` will be replaced by versions that just start with `cov-` that are included in that conda install. That should be installed with conda as shown on [that page](https://github.com/AstrobioMike/CoV-IRT-Micro), and the install `bit` instructions below should be ignored.**
---
[toc]
## Installing centrifuge
Creating and installing in a new conda environment:
```bash
# blast is included for dustmasker
conda create -y -n centrifuge -c conda-forge -c bioconda -c defaults centrifuge=1.0.4_beta blast=2.9.0
conda activate centrifuge
```
## Building on 12-Sept-2020
Takes about 2 days for the `centrifuge-build` command as run below.
```bash
# this step took a minute
centrifuge-download -P 50 -o taxonomy taxonomy
# this step took about 30 minutes
centrifuge-download -P 50 -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map
# adding human (this took about 15 minutes)
centrifuge-download -o library -m -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map
cat library/archaea/*.fna > input-sequences.fna
cat library/vertebrate_mammalian/*.fna >> input-sequences.fna
cat library/viral/*.fna >> input-sequences.fna
cat library/fungi/*.fna >> input-sequences.fna
cat library/bacteria/*.fna >> input-sequences.fna
# can get rid of these now to recover some space unless wanted
# but typically too many to do it with one command
rm library/archaea/*.fna
rm library/vertebrate_mammalian/*.fna
rm library/viral/*.fna
rm library/fungi/*.fna
rm library/bacteria/*.fna
centrifuge-build -p 50 --bmax 1342177280 --conversion-table seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
input-sequences.fna centrifuge-ref-db
```
## Database download
This is ~22GB compressed and 36GB decompressed/unpacked:
```bash
curl -L -o centrifuge-ref-db.tar.gz https://ndownloader.figshare.com/files/24725264
```
Unpacking/decompressing:
```bash
tar -xzvf centrifuge-ref-db.tar.gz
```
## Example run
Downloading one of our samples (CRR125950):
```bash
curl -L -o CRR125950_filtered_1_reads_trim25_1.fq.gz https://osf.io/hg7p4/download
```
Running classification:
```bash
centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1.fq.gz \
-S CRR125950-our-filtered-centrifuge-out.tsv \
--report-file CRR125950-our-filtered-centrifuge-report.tsv \
-k 1 -p 20
```
Running one without setting `-k 1`:
```bash
centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1.fq.gz \
-S CRR125950-our-filtered-centrifuge-out-no-k1.tsv \
--report-file CRR125950-our-filtered-centrifuge-report-no-k1.tsv \
-p 20
```
The `CRR125950-our-filtered-centrifuge-out.tsv` file holds the read-level results:
```bash
head CRR125950-our-filtered-centrifuge-out.tsv | column -ts $'\t' | sed 's/^/# /'
# readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
# A00159:145:H75T2DMXX:1:1101:7735:13792 NC_000011.10 9606 81 0 24 37 1
# A00159:145:H75T2DMXX:1:1101:10755:13573 unclassified 0 0 0 0 162 1
# A00159:145:H75T2DMXX:1:1101:31069:14215 unclassified 0 0 0 0 160 1
# A00159:145:H75T2DMXX:1:1101:3097:13808 unclassified 0 0 0 0 157 1
# A00159:145:H75T2DMXX:1:1101:10158:12383 unclassified 0 0 0 0 150 1
# A00159:145:H75T2DMXX:1:1101:11216:13557 unclassified 0 0 0 0 30 1
# A00159:145:H75T2DMXX:1:1101:1054:14622 unclassified 0 0 0 0 175 1
# A00159:145:H75T2DMXX:1:1101:9986:14559 unclassified 0 0 0 0 170 1
# A00159:145:H75T2DMXX:1:1101:22688:14074 NC_012920.1 9606 121 0 26 26 1
```
The `CRR125950-our-filtered-centrifuge-report.tsv` file holds the taxon-level summary results:
```bash
head CRR125950-our-filtered-centrifuge-report.tsv | column -ts $'\t' | sed 's/^/# /'
# name taxID taxRank genomeSize numReads numUniqueReads abundance
# root 1 no rank 0 6902 6902 0.0
# Bacteria 2 superkingdom 0 642 642 0.0
# Buchnera aphidicola 9 species 626933 15 15 0.0
# Shewanella 22 genus 5140018 1 1 0.0
# Shewanella putrefaciens 24 species 4749735 1 1 0.0
# Myxococcaceae 31 family 9636120 1 1 0.0
# Myxococcus xanthus 34 species 9139763 2 2 0.0
# Corallococcus macrosporus 35 species 8973512 1 1 0.0
# Cystobacter fuscus 43 species 12349744 2 2 0.0
```
And the "Abundance" column in that file sums to 1:
```bash
awk -F $'\t' ' { sum += $7 } END { print sum } ' CRR125950-our-filtered-centrifuge-report.tsv | sed 's/^/# /'
# 1
```
> NOTE
> * this seems to consider only what's classified in this file
> * I'm not entirely sure how it's getting the relative abundances – maybe incorporating read-length and genome-size, but looking at a run with all 100-bp reads didn't make disentangling it any easier
That file also only has the genus and species, we can add the rest with one of my helper programs:
### Adding full lineage info
```bash
# getting taxids
cut -f 2 CRR125950-our-filtered-centrifuge-report.tsv | tail -n +2 > taxids.tmp
# getting lineage info (bit 1.8.10+)
bit-get-lineage-from-taxids -i taxids.tmp -o cent-lineages.tmp
# combining with rel. abundances
paste CRR125950-our-filtered-centrifuge-report.tsv <( cut -f 2- cent-lineages.tmp ) | sed 's/specific_name/species/' > CRR125950-centrifuge-report-with-full-lineages.tsv
rm taxids.tmp cent-lineages.tmp
```
So the table now looks like this:
```bash
head CRR125950-centrifuge-report-with-full-lineages.tsv | column -ts $'\t' | sed 's/^/# /'
# name taxID taxRank genomeSize numReads numUniqueReads abundance domain phylum class order family genus species
# root 1 no rank 0 6902 6902 0.0 NA NA NA NA NA NA NA
# Bacteria 2 superkingdom 0 642 642 0.0 Bacteria NA NA NA NA NA NA
# Buchnera aphidicola 9 species 626933 15 15 0.0 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Buchnera Buchnera aphidicola
# Shewanella 22 genus 5140018 1 1 0.0 Bacteria Proteobacteria Gammaproteobacteria Alteromonadales Shewanellaceae Shewanella NA
# Shewanella putrefaciens 24 species 4749735 1 1 0.0 Bacteria Proteobacteria Gammaproteobacteria Alteromonadales Shewanellaceae Shewanella Shewanella putrefaciens
# Myxococcaceae 31 family 9636120 1 1 0.0 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Myxococcaceae NA NA
# Myxococcus xanthus 34 species 9139763 2 2 0.0 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Myxococcaceae Myxococcus Myxococcus xanthus
# Corallococcus macrosporus 35 species 8973512 1 1 0.0 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Myxococcaceae Corallococcus Corallococcus macrosporus
# Cystobacter fuscus 43 species 12349744 2 2 0.0 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Archangiaceae Cystobacter Cystobacter fuscus
```
---
### kraken-style report
Can make a kraken-style report (defined [here](https://github.com/DerrickWood/kraken2/wiki/Manual#sample-report-output-format) which has percentages of reads covered by each taxon (the percentages do not sum to 100 because they are reported for all ranks):
Running on one with `-k 1`:
```bash
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db CRR125950-our-filtered-centrifuge-out.tsv > CRR125950-our-filtered-centrifuge-kreport.tsv
```
```bash
head CRR125950-our-filtered-centrifuge-kreport.tsv | sed 's/^/# /'
# 79.29 1850506 1850506 U 0 unclassified
# 20.71 483448 6902 - 1 root
# 20.39 475810 0 - 131567 cellular organisms
# 18.64 435010 465 D 2759 Eukaryota
# 18.62 434545 0 - 33154 Opisthokonta
# 18.49 431617 0 K 33208 Metazoa
# 18.49 431617 0 - 6072 Eumetazoa
# 18.49 431617 0 - 33213 Bilateria
# 18.49 431617 0 - 33511 Deuterostomia
# 18.49 431617 0 P 7711 Chordata
```
Column 3 does sum to the total number of reads in the sample (2,333,954):
```bash
awk ' { sum += $3 } END { print sum } ' CRR125950-our-filtered-centrifuge-kreport.tsv | sed 's/^/# /'
# 2333954
```
So we could turn that into a proportion that includes unclassified, but we wouldn't have the same rank for everything, so would need to extend out further.
* E.g. in the above example, we'd have 465 reads assigned to Eukaryota, and those reads wouldn't have anything at the genus or species level if we were trying to summarize at those levels.
* Note to self: could take all rows where column 3 != 0; pull the read count and taxid; get full lineage info; then use the counts on any rank we want to summarize (this would just have "unclassified" for any read that didn't have one at that rank, which would be true at that rank, but maybe not a higher rank)
Running on one with no `-k 1`:
```bash
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db CRR125950-our-filtered-centrifuge-out-no-k1.tsv > CRR125950-our-filtered-centrifuge-kreport-no-k1.tsv
```
```bash
head CRR125950-our-filtered-centrifuge-kreport-no-k1.tsv | sed 's/^/# /'
# 79.29 1850506 1850506 U 0 unclassified
# 20.71 483448 5420 - 1 root
# 20.45 477297 1621 - 131567 cellular organisms
# 18.64 434946 0 D 2759 Eukaryota
# 18.64 434946 512 - 33154 Opisthokonta
# 18.49 431546 0 K 33208 Metazoa
# 18.49 431546 0 - 6072 Eumetazoa
# 18.49 431546 0 - 33213 Bilateria
# 18.49 431546 0 - 33511 Deuterostomia
# 18.49 431546 0 P 7711 Chordata
```
### Reporting fractions of read assignments instead of LCA
I'm kinda digging the idea behind the `--no-lca` option (see [here](https://github.com/DaehwanKimLab/centrifuge/issues/158#issuecomment-611768504)), seeing what that looks like:
**Running on one with `-k 1`:**
```bash
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca \
CRR125950-our-filtered-centrifuge-out.tsv \
> CRR125950-centrifuge-reformatted-no-lca.tsv
```
That one doesn't do anything different (meaning its output was the same as `CRR125950-our-filtered-centrifuge-kreport.tsv`). I assume this is because it was run with `-k 1`, so there were no assignments to split up into fractions (each read was forced to go to just one).
**Running on one with no `-k 1`:**
```bash
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca \
CRR125950-our-filtered-centrifuge-out-no-k1.tsv \
> CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv
```
```bash
head CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv | sed 's/^/# /'
# 79.29 1850512 1850512 U 0 unclassified
# 20.71 483441 0 - 1 root
# 20.58 480396 0 - 131567 cellular organisms
# 18.80 438853 0 D 2759 Eukaryota
# 18.80 438853 0 - 33154 Opisthokonta
# 18.66 435402 0 K 33208 Metazoa
# 18.66 435402 0 - 6072 Eumetazoa
# 18.66 435402 0 - 33213 Bilateria
# 18.66 435402 0 - 33511 Deuterostomia
# 18.66 435402 0 P 7711 Chordata
```
This comes up a little lower than the total number of reads in the sample (2,333,954):
```bash
awk ' { sum += $3 } END { print sum } ' CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv | sed 's/^/# /'
# 2332719
```
I don't know what to use.
## Running example with all equal read sizes
```bash
conda install -c conda-forge -c bioconda -c defaults fastp
```
Filtering to only keep reads of length 100:
```bash
fastp -i CRR125950_filtered_1_reads_trim25_1.fq.gz -o CRR125950_filtered_1_reads_trim25_1-len-filt.fq.gz --max_len1 100 -l 100 -t 4
```
**Running classification on same-length reads with `-k 1`:**
```bash
centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1-len-filt.fq.gz \
-S CRR125950-our-filtered-centrifuge-out-len-filt.tsv \
--report-file CRR125950-our-filtered-centrifuge-report-len-filt.tsv \
-k 1 -p 20
```
## Running simulated sample from Krista
Grabbing [this one](https://osf.io/r46cm/):
```bash
curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz https://osf.io/r46cm/download
```
And I think the right info file, at the species level, from [here](https://osf.io/m2d8y/):
```bash
curl -L -o Species_Buccal_Truth.csv https://osf.io/m2d8y/download
```
**Running centrifuge with `-k 1` set:**
```bash
centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz \
-S UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv \
--report-file UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv \
-k 1 -p 20
```
**Running centrifuge with no `-k` set:**
```bash
centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz \
-S UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv \
--report-file UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv \
-p 20
# and making a kreport for this one
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv > UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-kreport-no-k1.tsv
```
### Results
Result files can be downloaded with the following:
```bash
curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv https://ndownloader.figshare.com/files/24732191
curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv https://ndownloader.figshare.com/files/24732194
```
#### With `-k 1` set
Of the 600,000 reads, 25,691 were left unclassified:
```bash
grep -c -w "unclassified" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv | sed 's/^/# /'
# 25691
```
Peeking at the report tab, sorted by relative abundance (these top 20 hold > 99% of the estimated abundance):
```bash
cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( sort -t $'\t' -grk 7 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv | head -n 20 ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
```
|name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
|-|-|-|-|-|-|-|
|**Streptococcus thermophilus** | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
|**Streptococcus pyogenes** | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
|**Streptococcus pneumoniae** | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
|**Streptococcus mutans** | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
|**Streptococcus sanguinis** | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
|**Veillonella dispar** | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
|**Haemophilus influenzae** | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
|**Streptococcus equi** subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
|**Streptococcus salivarius** | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
|**Streptococcus dysgalactiae** | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
|Streptococcus pneumoniae CGSP14 | 516950 | strain | 2209198 | 2618 | 2618 | 0.00635523 |
|**Streptococcus pyogenes** MGAS10270 | 370552 | strain | 1928252 | 2250 | 2250 | 0.00587681 |
|Streptococcus salivarius CCHSS3 | 1048332 | strain | 2217184 | 2844 | 2844 | 0.00469715 |
|**Haemophilus parainfluenzae** T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
|Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
|Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
|Streptococcus mutans GS-5 | 1198676 | strain | 2027088 | 359 | 359 | 0.000748662 |
|Streptococcus salivarius JIM8777 | 347253 | strain | 2210574 | 683 | 683 | 0.000614505 |
|Streptococcus sp. NCTC 11567 | 2583584 | species | 2147716 | 236 | 236 | 0.000529961 |
|Streptococcus sp. FDAARGOS_192 | 1839799 | species | 2435494 | 488 | 488 | 0.000275644 |
Could take all with an abundance assignment, convert their taxids to lineages, combine at species- or genus-level, then will have to standard ranks...
|name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
|-|-|-|-|-|-|-|
|Streptococcus thermophilus | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
|Streptococcus sanguinis | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
|Haemophilus influenzae | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
|Streptococcus pneumoniae | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
|Streptococcus mutans | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
|Streptococcus | 1301 | genus | 2449574 | 46675 | 46675 | 0.0 |
|Streptococcus pyogenes | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
|Veillonella dispar | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
|Streptococcus salivarius | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
|Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
|Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
|Haemophilus parainfluenzae T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
|Streptococcus dysgalactiae | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
|Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
|Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
|Neisseria mucosa | 488 | species | 5008700 | 5479 | 5479 | 0.0 |
|Neisseria | 482 | genus | 2223758 | 4680 | 4680 | 0.0 |
|Veillonella | 29465 | genus | 2132142 | 4427 | 4427 | 0.0 |
|Neisseria flavescens | 484 | species | 2231882 | 3101 | 3101 | 0.0 |
This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):
```bash
cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( grep "Streptococcus equi" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
```
|name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
|-|-|-|-|-|-|-|
|Streptococcus equinus | 1335 | species | 13680225 | 310 | 310 | 5.26736e-05 |
|Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
|Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
|Streptococcus equi subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
|Streptococcus equi subsp. equi 4047 | 553482 | strain | 2253793 | 640 | 640 | 0 |
|Streptococcus equi subsp. zooepidemicus CY | 1403449 | strain | 2107382 | 85 | 85 | 0.0 |
#### Without `-k 1` set
Of the 600,000 reads, 25,691 were left unclassified (same):
```bash
grep -c -w "unclassified" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv | sed 's/^/# /'
# 25691
```
Peeking at the report tab, sorted by number of reads assigned:
```bash
cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv ) <( sort -t $'\t' -nrk6 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv ) | head -n 20 | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
```
|name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
|-|-|-|-|-|-|-|
|Streptococcus thermophilus | 1308 | species | 1844731 | 49994 | 49994 | 0.143208 |
|Streptococcus sanguinis | 1305 | species | 2388435 | 49893 | 49893 | 0.109841 |
|Haemophilus influenzae | 727 | species | 2278803 | 48421 | 48421 | 0.0980891 |
|Streptococcus pneumoniae | 1313 | species | 2114821 | 47055 | 47055 | 0.119754 |
|Streptococcus mutans | 1309 | species | 2024183 | 46686 | 46686 | 0.110201 |
|Streptococcus | 1301 | genus | 2449574 | 46675 | 46675 | 0.0 |
|Streptococcus pyogenes | 1314 | species | 1944914 | 45724 | 45724 | 0.124289 |
|Veillonella dispar | 39778 | species | 2116915 | 45427 | 45427 | 0.107906 |
|Streptococcus salivarius | 1304 | species | 2213879 | 40471 | 40471 | 0.0740379 |
|Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
|Haemophilus parainfluenzae | 729 | species | 2086875 | 27051 | 27051 | 0.00240993 |
|Haemophilus parainfluenzae T3T1 | 862965 | strain | 2086875 | 14188 | 14188 | 0.00360981 |
|Streptococcus dysgalactiae | 1334 | species | 7257305 | 12007 | 12007 | 0.00710079 |
|Neisseria subflava | 28449 | species | 4517530 | 11978 | 11978 | 0.00108579 |
|Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
|Neisseria mucosa | 488 | species | 5008700 | 5479 | 5479 | 0.0 |
|Neisseria | 482 | genus | 2223758 | 4680 | 4680 | 0.0 |
|Veillonella | 29465 | genus | 2132142 | 4427 | 4427 | 0.0 |
|Neisseria flavescens | 484 | species | 2231882 | 3101 | 3101 | 0.0 |
This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):
```bash
cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( grep "Streptococcus equi" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
```
|name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance |
|-|-|-|-|-|-|-|
|Streptococcus equinus | 1335 | species | 13680225 | 310 | 310 | 5.26736e-05 |
|Streptococcus equi | 1336 | species | 5577323 | 36424 | 36424 | 0.0 |
|Streptococcus equi subsp. zooepidemicus | 40041 | subspecies | 19334008 | 6114 | 6114 | 0 |
|Streptococcus equi subsp. zooepidemicus MGCS10565 | 552526 | strain | 2024171 | 145 | 145 | 0.0784375 |
|Streptococcus equi subsp. equi 4047 | 553482 | strain | 2253793 | 640 | 640 | 0 |
|Streptococcus equi subsp. zooepidemicus CY | 1403449 | strain | 2107382 | 85 | 85 | 0.0 |