Try   HackMD

Centrifuge reference database build


UPDATE

In getting closer to publication, I split these CoV-IRT microbial subgroup related programs into their own conda package. The custom programs on this page that start with bit- will be replaced by versions that just start with cov- that are included in that conda install. That should be installed with conda as shown on that page, and the install bit instructions below should be ignored.


Installing centrifuge

Creating and installing in a new conda environment:

  # blast is included for dustmasker
conda create -y -n centrifuge -c conda-forge -c bioconda -c defaults centrifuge=1.0.4_beta blast=2.9.0

conda activate centrifuge

Building on 12-Sept-2020

Takes about 2 days for the centrifuge-build command as run below.

  # this step took a minute
centrifuge-download -P 50 -o taxonomy taxonomy
  # this step took about 30 minutes
centrifuge-download -P 50 -o library -m -d "archaea,bacteria,viral,fungi" refseq > seqid2taxid.map

  # adding human (this took about 15 minutes)
centrifuge-download -o library -m -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map

cat library/archaea/*.fna > input-sequences.fna
cat library/vertebrate_mammalian/*.fna >> input-sequences.fna
cat library/viral/*.fna >> input-sequences.fna
cat library/fungi/*.fna >> input-sequences.fna
cat library/bacteria/*.fna >> input-sequences.fna

# can get rid of these now to recover some space unless wanted
  # but typically too many to do it with one command
rm library/archaea/*.fna
rm library/vertebrate_mammalian/*.fna
rm library/viral/*.fna
rm library/fungi/*.fna
rm library/bacteria/*.fna

centrifuge-build -p 50 --bmax 1342177280 --conversion-table seqid2taxid.map \
                 --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
                 input-sequences.fna centrifuge-ref-db

Database download

This is ~22GB compressed and 36GB decompressed/unpacked:

curl -L -o centrifuge-ref-db.tar.gz https://ndownloader.figshare.com/files/24725264

Unpacking/decompressing:

tar -xzvf centrifuge-ref-db.tar.gz

Example run

Downloading one of our samples (CRR125950):

curl -L -o CRR125950_filtered_1_reads_trim25_1.fq.gz https://osf.io/hg7p4/download

Running classification:

centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1.fq.gz \
           -S CRR125950-our-filtered-centrifuge-out.tsv \
           --report-file CRR125950-our-filtered-centrifuge-report.tsv \
           -k 1 -p 20

Running one without setting -k 1:

centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1.fq.gz \
           -S CRR125950-our-filtered-centrifuge-out-no-k1.tsv \
           --report-file CRR125950-our-filtered-centrifuge-report-no-k1.tsv \
           -p 20

The CRR125950-our-filtered-centrifuge-out.tsv file holds the read-level results:

head CRR125950-our-filtered-centrifuge-out.tsv | column -ts $'\t' | sed 's/^/# /'
# readID                                   seqID         taxID  score  2ndBestScore  hitLength  queryLength  numMatches
# A00159:145:H75T2DMXX:1:1101:7735:13792   NC_000011.10  9606   81     0             24         37           1
# A00159:145:H75T2DMXX:1:1101:10755:13573  unclassified  0      0      0             0          162          1
# A00159:145:H75T2DMXX:1:1101:31069:14215  unclassified  0      0      0             0          160          1
# A00159:145:H75T2DMXX:1:1101:3097:13808   unclassified  0      0      0             0          157          1
# A00159:145:H75T2DMXX:1:1101:10158:12383  unclassified  0      0      0             0          150          1
# A00159:145:H75T2DMXX:1:1101:11216:13557  unclassified  0      0      0             0          30           1
# A00159:145:H75T2DMXX:1:1101:1054:14622   unclassified  0      0      0             0          175          1
# A00159:145:H75T2DMXX:1:1101:9986:14559   unclassified  0      0      0             0          170          1
# A00159:145:H75T2DMXX:1:1101:22688:14074  NC_012920.1   9606   121    0             26         26           1

The CRR125950-our-filtered-centrifuge-report.tsv file holds the taxon-level summary results:

head CRR125950-our-filtered-centrifuge-report.tsv | column -ts $'\t' | sed 's/^/# /'
# name                       taxID  taxRank       genomeSize  numReads  numUniqueReads  abundance
# root                       1      no rank       0           6902      6902            0.0
# Bacteria                   2      superkingdom  0           642       642             0.0
# Buchnera aphidicola        9      species       626933      15        15              0.0
# Shewanella                 22     genus         5140018     1         1               0.0
# Shewanella putrefaciens    24     species       4749735     1         1               0.0
# Myxococcaceae              31     family        9636120     1         1               0.0
# Myxococcus xanthus         34     species       9139763     2         2               0.0
# Corallococcus macrosporus  35     species       8973512     1         1               0.0
# Cystobacter fuscus         43     species       12349744    2         2               0.0

And the "Abundance" column in that file sums to 1:

awk -F $'\t' ' { sum += $7 } END { print sum } ' CRR125950-our-filtered-centrifuge-report.tsv | sed 's/^/# /'
# 1

NOTE

  • this seems to consider only what's classified in this file
  • I'm not entirely sure how it's getting the relative abundances – maybe incorporating read-length and genome-size, but looking at a run with all 100-bp reads didn't make disentangling it any easier

That file also only has the genus and species, we can add the rest with one of my helper programs:

Adding full lineage info

    # getting taxids
cut -f 2 CRR125950-our-filtered-centrifuge-report.tsv | tail -n +2 > taxids.tmp

    # getting lineage info (bit 1.8.10+)
bit-get-lineage-from-taxids -i taxids.tmp -o cent-lineages.tmp

    # combining with rel. abundances
paste CRR125950-our-filtered-centrifuge-report.tsv <( cut -f 2- cent-lineages.tmp ) | sed 's/specific_name/species/' > CRR125950-centrifuge-report-with-full-lineages.tsv

rm taxids.tmp cent-lineages.tmp

So the table now looks like this:

head CRR125950-centrifuge-report-with-full-lineages.tsv | column -ts $'\t' | sed 's/^/# /'
# name                       taxID  taxRank       genomeSize  numReads  numUniqueReads  abundance  domain    phylum          class                order             family          genus          species
# root                       1      no rank       0           6902      6902            0.0        NA        NA              NA                   NA                NA              NA             NA
# Bacteria                   2      superkingdom  0           642       642             0.0        Bacteria  NA              NA                   NA                NA              NA             NA
# Buchnera aphidicola        9      species       626933      15        15              0.0        Bacteria  Proteobacteria  Gammaproteobacteria  Enterobacterales  Erwiniaceae     Buchnera       Buchnera aphidicola
# Shewanella                 22     genus         5140018     1         1               0.0        Bacteria  Proteobacteria  Gammaproteobacteria  Alteromonadales   Shewanellaceae  Shewanella     NA
# Shewanella putrefaciens    24     species       4749735     1         1               0.0        Bacteria  Proteobacteria  Gammaproteobacteria  Alteromonadales   Shewanellaceae  Shewanella     Shewanella putrefaciens
# Myxococcaceae              31     family        9636120     1         1               0.0        Bacteria  Proteobacteria  Deltaproteobacteria  Myxococcales      Myxococcaceae   NA             NA
# Myxococcus xanthus         34     species       9139763     2         2               0.0        Bacteria  Proteobacteria  Deltaproteobacteria  Myxococcales      Myxococcaceae   Myxococcus     Myxococcus xanthus
# Corallococcus macrosporus  35     species       8973512     1         1               0.0        Bacteria  Proteobacteria  Deltaproteobacteria  Myxococcales      Myxococcaceae   Corallococcus  Corallococcus macrosporus
# Cystobacter fuscus         43     species       12349744    2         2               0.0        Bacteria  Proteobacteria  Deltaproteobacteria  Myxococcales      Archangiaceae   Cystobacter    Cystobacter fuscus

kraken-style report

Can make a kraken-style report (defined here which has percentages of reads covered by each taxon (the percentages do not sum to 100 because they are reported for all ranks):

Running on one with -k 1:

centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db CRR125950-our-filtered-centrifuge-out.tsv > CRR125950-our-filtered-centrifuge-kreport.tsv
head CRR125950-our-filtered-centrifuge-kreport.tsv | sed 's/^/# /'
#  79.29        1850506 1850506 U       0       unclassified
#  20.71        483448  6902    -       1       root
#  20.39        475810  0       -       131567    cellular organisms
#  18.64        435010  465     D       2759        Eukaryota
#  18.62        434545  0       -       33154         Opisthokonta
#  18.49        431617  0       K       33208           Metazoa
#  18.49        431617  0       -       6072              Eumetazoa
#  18.49        431617  0       -       33213               Bilateria
#  18.49        431617  0       -       33511                 Deuterostomia
#  18.49        431617  0       P       7711                    Chordata

Column 3 does sum to the total number of reads in the sample (2,333,954):

awk ' { sum += $3 } END { print sum } ' CRR125950-our-filtered-centrifuge-kreport.tsv | sed 's/^/# /'
# 2333954

So we could turn that into a proportion that includes unclassified, but we wouldn't have the same rank for everything, so would need to extend out further.
* E.g. in the above example, we'd have 465 reads assigned to Eukaryota, and those reads wouldn't have anything at the genus or species level if we were trying to summarize at those levels.
* Note to self: could take all rows where column 3 != 0; pull the read count and taxid; get full lineage info; then use the counts on any rank we want to summarize (this would just have "unclassified" for any read that didn't have one at that rank, which would be true at that rank, but maybe not a higher rank)

Running on one with no -k 1:

centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db CRR125950-our-filtered-centrifuge-out-no-k1.tsv > CRR125950-our-filtered-centrifuge-kreport-no-k1.tsv
head CRR125950-our-filtered-centrifuge-kreport-no-k1.tsv | sed 's/^/# /'
#  79.29        1850506 1850506 U       0       unclassified
#  20.71        483448  5420    -       1       root
#  20.45        477297  1621    -       131567    cellular organisms
#  18.64        434946  0       D       2759        Eukaryota
#  18.64        434946  512     -       33154         Opisthokonta
#  18.49        431546  0       K       33208           Metazoa
#  18.49        431546  0       -       6072              Eumetazoa
#  18.49        431546  0       -       33213               Bilateria
#  18.49        431546  0       -       33511                 Deuterostomia
#  18.49        431546  0       P       7711                    Chordata

Reporting fractions of read assignments instead of LCA

I'm kinda digging the idea behind the --no-lca option (see here), seeing what that looks like:

Running on one with -k 1:

centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca \
                   CRR125950-our-filtered-centrifuge-out.tsv \
                   > CRR125950-centrifuge-reformatted-no-lca.tsv

That one doesn't do anything different (meaning its output was the same as CRR125950-our-filtered-centrifuge-kreport.tsv). I assume this is because it was run with -k 1, so there were no assignments to split up into fractions (each read was forced to go to just one).

Running on one with no -k 1:

centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca \
                   CRR125950-our-filtered-centrifuge-out-no-k1.tsv \
                   > CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv
head CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv | sed 's/^/# /'
#  79.29	1850512	1850512	U	0	unclassified
#  20.71	483441	0	-	1	root
#  20.58	480396	0	-	131567	  cellular organisms
#  18.80	438853	0	D	2759	    Eukaryota
#  18.80	438853	0	-	33154	      Opisthokonta
#  18.66	435402	0	K	33208	        Metazoa
#  18.66	435402	0	-	6072	          Eumetazoa
#  18.66	435402	0	-	33213	            Bilateria
#  18.66	435402	0	-	33511	              Deuterostomia
#  18.66	435402	0	P	7711	                Chordata

This comes up a little lower than the total number of reads in the sample (2,333,954):

awk ' { sum += $3 } END { print sum } ' CRR125950-centrifuge-reformatted-no-lca-no-k1.tsv | sed 's/^/# /'
# 2332719

I don't know what to use.

Running example with all equal read sizes

conda install -c conda-forge -c bioconda -c defaults fastp

Filtering to only keep reads of length 100:

fastp -i CRR125950_filtered_1_reads_trim25_1.fq.gz -o CRR125950_filtered_1_reads_trim25_1-len-filt.fq.gz --max_len1 100 -l 100 -t 4

Running classification on same-length reads with -k 1:

centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U CRR125950_filtered_1_reads_trim25_1-len-filt.fq.gz \
           -S CRR125950-our-filtered-centrifuge-out-len-filt.tsv \
           --report-file CRR125950-our-filtered-centrifuge-report-len-filt.tsv \
           -k 1 -p 20

Running simulated sample from Krista

Grabbing this one:

curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz https://osf.io/r46cm/download

And I think the right info file, at the species level, from here:

curl -L -o Species_Buccal_Truth.csv https://osf.io/m2d8y/download

Running centrifuge with -k 1 set:

centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz \
           -S UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv \
           --report-file UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv \
           -k 1 -p 20

Running centrifuge with no -k set:

centrifuge -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db -U UnAmbiguouslyMapped_ds.buccal_trim2.fq.gz \
           -S UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv \
           --report-file UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv \
           -p 20

  # and making a kreport for this one
centrifuge-kreport -x 12-Sept-2020-centrifuge-ref-db/centrifuge-ref-db --no-lca UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv > UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-kreport-no-k1.tsv

Results

Result files can be downloaded with the following:

curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv https://ndownloader.figshare.com/files/24732191

curl -L -o UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv https://ndownloader.figshare.com/files/24732194

With -k 1 set

Of the 600,000 reads, 25,691 were left unclassified:

grep -c -w "unclassified" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out.tsv | sed 's/^/# /'
# 25691

Peeking at the report tab, sorted by relative abundance (these top 20 hold > 99% of the estimated abundance):

cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( sort -t $'\t' -grk 7 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv | head -n 20 ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
name taxID taxRank genomeSize numReads numUniqueReads abundance
Streptococcus thermophilus 1308 species 1844731 49994 49994 0.143208
Streptococcus pyogenes 1314 species 1944914 45724 45724 0.124289
Streptococcus pneumoniae 1313 species 2114821 47055 47055 0.119754
Streptococcus mutans 1309 species 2024183 46686 46686 0.110201
Streptococcus sanguinis 1305 species 2388435 49893 49893 0.109841
Veillonella dispar 39778 species 2116915 45427 45427 0.107906
Haemophilus influenzae 727 species 2278803 48421 48421 0.0980891
Streptococcus equi subsp. zooepidemicus MGCS10565 552526 strain 2024171 145 145 0.0784375
Streptococcus salivarius 1304 species 2213879 40471 40471 0.0740379
Streptococcus dysgalactiae 1334 species 7257305 12007 12007 0.00710079
Streptococcus pneumoniae CGSP14 516950 strain 2209198 2618 2618 0.00635523
Streptococcus pyogenes MGAS10270 370552 strain 1928252 2250 2250 0.00587681
Streptococcus salivarius CCHSS3 1048332 strain 2217184 2844 2844 0.00469715
Haemophilus parainfluenzae T3T1 862965 strain 2086875 14188 14188 0.00360981
Haemophilus parainfluenzae 729 species 2086875 27051 27051 0.00240993
Neisseria subflava 28449 species 4517530 11978 11978 0.00108579
Streptococcus mutans GS-5 1198676 strain 2027088 359 359 0.000748662
Streptococcus salivarius JIM8777 347253 strain 2210574 683 683 0.000614505
Streptococcus sp. NCTC 11567 2583584 species 2147716 236 236 0.000529961
Streptococcus sp. FDAARGOS_192 1839799 species 2435494 488 488 0.000275644

Could take all with an abundance assignment, convert their taxids to lineages, combine at species- or genus-level, then will have to standard ranks

name taxID taxRank genomeSize numReads numUniqueReads abundance
Streptococcus thermophilus 1308 species 1844731 49994 49994 0.143208
Streptococcus sanguinis 1305 species 2388435 49893 49893 0.109841
Haemophilus influenzae 727 species 2278803 48421 48421 0.0980891
Streptococcus pneumoniae 1313 species 2114821 47055 47055 0.119754
Streptococcus mutans 1309 species 2024183 46686 46686 0.110201
Streptococcus 1301 genus 2449574 46675 46675 0.0
Streptococcus pyogenes 1314 species 1944914 45724 45724 0.124289
Veillonella dispar 39778 species 2116915 45427 45427 0.107906
Streptococcus salivarius 1304 species 2213879 40471 40471 0.0740379
Streptococcus equi 1336 species 5577323 36424 36424 0.0
Haemophilus parainfluenzae 729 species 2086875 27051 27051 0.00240993
Haemophilus parainfluenzae T3T1 862965 strain 2086875 14188 14188 0.00360981
Streptococcus dysgalactiae 1334 species 7257305 12007 12007 0.00710079
Neisseria subflava 28449 species 4517530 11978 11978 0.00108579
Streptococcus equi subsp. zooepidemicus 40041 subspecies 19334008 6114 6114 0
Neisseria mucosa 488 species 5008700 5479 5479 0.0
Neisseria 482 genus 2223758 4680 4680 0.0
Veillonella 29465 genus 2132142 4427 4427 0.0
Neisseria flavescens 484 species 2231882 3101 3101 0.0

This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):

cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( grep "Streptococcus equi" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
name taxID taxRank genomeSize numReads numUniqueReads abundance
Streptococcus equinus 1335 species 13680225 310 310 5.26736e-05
Streptococcus equi 1336 species 5577323 36424 36424 0.0
Streptococcus equi subsp. zooepidemicus 40041 subspecies 19334008 6114 6114 0
Streptococcus equi subsp. zooepidemicus MGCS10565 552526 strain 2024171 145 145 0.0784375
Streptococcus equi subsp. equi 4047 553482 strain 2253793 640 640 0
Streptococcus equi subsp. zooepidemicus CY 1403449 strain 2107382 85 85 0.0

Without -k 1 set

Of the 600,000 reads, 25,691 were left unclassified (same):

grep -c -w "unclassified" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-out-no-k1.tsv | sed 's/^/# /'
# 25691

Peeking at the report tab, sorted by number of reads assigned:

cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv ) <( sort -t $'\t' -nrk6 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report-no-k1.tsv ) | head -n 20 | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
name taxID taxRank genomeSize numReads numUniqueReads abundance
Streptococcus thermophilus 1308 species 1844731 49994 49994 0.143208
Streptococcus sanguinis 1305 species 2388435 49893 49893 0.109841
Haemophilus influenzae 727 species 2278803 48421 48421 0.0980891
Streptococcus pneumoniae 1313 species 2114821 47055 47055 0.119754
Streptococcus mutans 1309 species 2024183 46686 46686 0.110201
Streptococcus 1301 genus 2449574 46675 46675 0.0
Streptococcus pyogenes 1314 species 1944914 45724 45724 0.124289
Veillonella dispar 39778 species 2116915 45427 45427 0.107906
Streptococcus salivarius 1304 species 2213879 40471 40471 0.0740379
Streptococcus equi 1336 species 5577323 36424 36424 0.0
Haemophilus parainfluenzae 729 species 2086875 27051 27051 0.00240993
Haemophilus parainfluenzae T3T1 862965 strain 2086875 14188 14188 0.00360981
Streptococcus dysgalactiae 1334 species 7257305 12007 12007 0.00710079
Neisseria subflava 28449 species 4517530 11978 11978 0.00108579
Streptococcus equi subsp. zooepidemicus 40041 subspecies 19334008 6114 6114 0
Neisseria mucosa 488 species 5008700 5479 5479 0.0
Neisseria 482 genus 2223758 4680 4680 0.0
Veillonella 29465 genus 2132142 4427 4427 0.0
Neisseria flavescens 484 species 2231882 3101 3101 0.0

This adds to the complication of understanding their "abundance" column, as anything that has a lower rank present will have a 0 for abundance (e.g. Steptococcus genus, or Streptococcus equi – S. equi, has abundance lower down assigned to a specific strain that has few counts, but gets 8% of the population):

cat <( head -n 1 UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) <( grep "Streptococcus equi" UnAmbiguouslyMapped_ds.buccal_trim2-centrifuge-report.tsv ) | sed 's/^/|/' | sed "s/\\t/ | /g" | sed 's/$/ |/'
name taxID taxRank genomeSize numReads numUniqueReads abundance
Streptococcus equinus 1335 species 13680225 310 310 5.26736e-05
Streptococcus equi 1336 species 5577323 36424 36424 0.0
Streptococcus equi subsp. zooepidemicus 40041 subspecies 19334008 6114 6114 0
Streptococcus equi subsp. zooepidemicus MGCS10565 552526 strain 2024171 145 145 0.0784375
Streptococcus equi subsp. equi 4047 553482 strain 2253793 640 640 0
Streptococcus equi subsp. zooepidemicus CY 1403449 strain 2107382 85 85 0.0