# Centrifuge
conda create -y -n centrifuge -c conda-forge -c bioconda -c defaults centrifuge=1.0.4_beta
## Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem.
## This is how Centrifuge is ran on a cloud instance, with our goal being to generate sample output files.
### Cloud Instance: m1.xxlarge(CPU: 44, Mem: 120GB, Disk: 60 GB)
SSH: jouvens@149.165.168.116
## Example Run
### Activate:
`conda activate centrifuge`
### Switch to external volume: vol_b
`'cd /vol_b/'`
### Database download:
`scp -r astrobio@149.165.170.83:/vol_b/centrifuge-db/ .`
### Dataset download:
curl -L -o metagenomic-read-files.tar.gz https://ndownloader.figshare.com/files/24079451
### Unpack the dataset:
`tar -xzvf metagenomic-read-files.tar.gz`
### Setting up the parameters:
```
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi \
-1 sample-1-R1.fq.gz -2 sample-1-R2.fq.gz \
-S sample-1-centrifuge-out.tsv --report-file sample-1-centrifuge-report.tsv \
-k 1 -p 42
```
### Automation:
**First Attempt**
```
find metagenomic-read-files -maxdepth 1 -name "*R1*" -print |sort | cat > SampleSet.txt
find metagenomic-read-files -maxdepth 1 -name "*R2*" -print |sort | cat > R2samples.txt
paste SampleSet.txt R2samples.txt | cat >> FinalSet.txt
while read -r ROne RTwo
do
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi \
--paired-reads $ROne $RTwo\
-t 42 -o result
done < FinalSet.txt
rm FinalSet.txt
```
**Second Attempt**
```
find metagenomic-read-files -maxdepth 1 -name "*R1*" -print |sort | cat > SampleSet.txt
find metagenomic-read-files -maxdepth 1 -name "*R2*" -print |sort | cat > R2samples.txt
paste SampleSet.txt R2samples.txt | cat >> FinalSet.txt
while read -r ROne RTwo
do
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi \ -1
$ROne -2 $RTwo\
-t 42 -o result
done < FinalSet.txt
rm FinalSet.txt
```
**Third Attempt**
```
find metagenomic-read-files -maxdepth 1 -name "*R1*" -print |sort | cat > SampleSet.txt
find metagenomic-read-files -maxdepth 1 -name "*R2*" -print |sort | cat > R2samples.txt
paste SampleSet.txt R2samples.txt | cat >> FinalSet.txt
while read -r ROne RTwo
do
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi -1 $ROne -2 $RTwo -S test_sample-out.tsv --report-file sample-1-centrifuge-report.tsv -k 1 -p 40
done < FinalSet.txt
rm FinalSet.txt
```
**Final Attempt**
```
script centrifuge_script.sh
conda activate centrifuge
for f in metagenomic-read-files/*_R1_trimmed.fastq.gz # for each sample F
do
echo "start time: " $(date +%T)
start=`date +%s`
n=${f%%_R1_trimmed.fastq.gz} # strip part of file name
centrifuge -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi -1 ${n}_R1_trimmed.fastq.gz -2 ${n}_R2_trimmed.fastq.gz -S ${n}_out.tsv --report-file ${n}_report.tsv -k 1 -p 42
done
for file in metagenomic-read-files/*_out.tsv
do
n=${file%%_out.tsv}
centrifuge-kreport -x centrifuge-db/centrifuge-complete-genomes-arc-bac-human-viral-fungi $file > ${n}_centrifuge_reformatted.tsv
done
exit
```
**Classifier Memory Usage**
http://149.165.168.116:8000/lab/tree/classifier_mem_usage.txt
**Results**
```
start time: 23:02:22
report file metagenomic-read-files/sample1_report.txt
Number of iterations in EM algorithm: 3
Probability diff. (P - P_prev) in the last iteration: 3.60962e-14
Calculating abundance: 00:00:00
start time: 23:09:31
report file metagenomic-read-files/sample2_report.txt
Number of iterations in EM algorithm: 3
Probability diff. (P - P_prev) in the last iteration: 2.36382e-14
Calculating abundance: 00:00:00
start time: 23:16:33
report file metagenomic-read-files/sample3_report.txt
Number of iterations in EM algorithm: 3
Probability diff. (P - P_prev) in the last iteration: 3.60962e-14
Calculating abundance: 00:00:00
start time: 23:24:06
report file metagenomic-read-files/sample4_report.txt
Number of iterations in EM algorithm: 4
Probability diff. (P - P_prev) in the last iteration: 1.41542e-13
Calculating abundance: 00:00:00
start time: 23:32:08
report file metagenomic-read-files/sample5_report.txt
Number of iterations in EM algorithm: 38
Probability diff. (P - P_prev) in the last iteration: 6.35734e-12
Calculating abundance: 00:00:00
start time: 23:39:39
report file metagenomic-read-files/sample6_report.txt
Number of iterations in EM algorithm: 34
Probability diff. (P - P_prev) in the last iteration: 2.19702e-11
Calculating abundance: 00:00:00
start time: 23:46:56
report file metagenomic-read-files/sample7_report.txt
Number of iterations in EM algorithm: 31
Probability diff. (P - P_prev) in the last iteration: 8.07652e-11
Calculating abundance: 00:00:00
start time: 23:54:29
report file metagenomic-read-files/sample8_report.txt
Number of iterations in EM algorithm: 29
Probability diff. (P - P_prev) in the last iteration: 5.05312e-11
Calculating abundance: 00:00:00
____________________________________________________________
:50:38 real, 43768.53 user, 13310.67 sys, 37522344 max_mem, 0 av_mem, 859% cpu
```