# Kraken 2/Bracken
### Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm.
## Install Kraken 2 and Bracken:
`conda create -y -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta bracken=2.6.0`
## Download the database:
`scp -r astrobio@149.165.170.83:/vol_b/kraken2-db/ .`
## Download the dataset:
`curl -L -o metagenomic-read-files.tar.gz https://ndownloader.figshare.com/files/24079451`
## Unpack the dataset:
```
tar -xzvf metagenomic-read-files.tar.gz
#Setting up the parameters:
kraken2 --db kraken2-db/ --threads 6 \
--output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \
--paired sample-1-R1.fq.gz sample-1-R2.fq.gz
```
## Output error1:
```
Command 'kraken2' not found, did you mean:
command 'kraken' from deb kraken
Try: apt install <deb name>
```
## Output 1 (Results - Bracken):
```
Loading database information... done.
2 sequences (0.00 Mbp) processed in 0.001s (80.8 Kseq/m, 25.84 Mbp/m).
2 sequences classified (100.00%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample-1-kraken2-report.txt -o sample-1-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-05-2020 22:27:28
BRACKEN SUMMARY (Kraken report: sample-1-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 1
>> Number of species with reads > threshold: 1
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 2
>> Total reads kept at species level (reads > threshold): 1
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 1
>> Reads not distributed (eg. no species above threshold): 0
>> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: sample-1-bracken-out.tsv
PROGRAM END TIME: 08-05-2020 22:27:28
Bracken complete.
```
## Automation Process
### Making the sample-list.txt file
```
for sample in $(cat sample-list.txt)
do
echo $sample
# what we want the kraken --output file to be
echo "This is what the regular kraken output file will be named: ${sample}-kraken2-out.txt"
# what we want the kraken --report file to be
echo "This is what the kraken report file will be named: ${sample}-kraken2-report.txt"
# trying to pass the forward read file
echo "This is where we think the input forward read file is: ${sample}-R1.fq.gz"
# we can check it exists with the ls command
ls ${sample}-R1.fq.gz
# trying to pass the reverse read file
echo "This is where we think the input reverse read file is: ${sample}-R2.fq.gz"
# we can check it exists with the ls command
ls ${sample}-R2.fq.gz
# and just adding so space between each iteration
echo ""
echo ""
done
```
### Full Automation
```
for sample in $(cat sample-list.txt)
do
echo On: $sample
kraken2 --db kraken2-db/ --threads 6 \
--output ${sample}-kraken2-out.txt --report ${sample}-kraken2-report.txt \
--paired metagenomic-read-files/${sample}_R1_trimmed.fastq.gz metagenomic-read-files/${sample}_R2_trimmed.fastq.gz
bracken -r 150 -d kraken2-db/ -i ${sample}-kraken2-report.txt -o ${sample}-bracken-out.tsv
done
CpuUse=0
MemUse=0
DiskUse=$(du -sh metagenomic-read-files | awk '{print $1}')
Count=0
tempCPU=$(top -H -b -n1 -d1 | grep "Threads" | awk '{print $4}')
tempMem=$(free -g |grep "Mem:" | awk '{print $2}')
CpuUse=`expr $CpuUse + $tempCPU`
if [ $tempMem -gt $MemUse ]
then
MemUse=$tempMem
fi
Count=`expr $Count + 1`
echo `expr $CpuUse / $Count`
echo $MemUse
echo $DiskUse
```
### Time Command
```
command time -f "\t%E real, \t%U user, \t%S sys, \t%M max_mem, \t%K av_mem, \t%P cpu" -o usage-info.txt bash kraken-bracken.sh
```
## Results
```
On: sample1
Loading database information... done.
4283096 sequences (1199.29 Mbp) processed in 48.627s (5284.8 Kseq/m, 1479.78 Mbp/m).
3949426 sequences classified (92.21%)
333670 sequences unclassified (7.79%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample1-kraken2-report.txt -o sample1-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 20:38:06
BRACKEN SUMMARY (Kraken report: sample1-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 3245
>> Number of species with reads > threshold: 3245
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4283096
>> Total reads kept at species level (reads > threshold): 2816840
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 1131837
>> Reads not distributed (eg. no species above threshold): 749
>> Unclassified reads: 333670
BRACKEN OUTPUT PRODUCED: sample1-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 20:38:06
Bracken complete.
On: sample2
Loading database information... done.
4327888 sequences (1202.67 Mbp) processed in 51.709s (5021.8 Kseq/m, 1395.50 Mbp/m).
4049074 sequences classified (93.56%)
278814 sequences unclassified (6.44%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample2-kraken2-report.txt -o sample2-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 20:46:21
BRACKEN SUMMARY (Kraken report: sample2-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 3092
>> Number of species with reads > threshold: 3092
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4327888
>> Total reads kept at species level (reads > threshold): 2811244
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 1237198
>> Reads not distributed (eg. no species above threshold): 632
>> Unclassified reads: 278814
BRACKEN OUTPUT PRODUCED: sample2-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 20:46:21
Bracken complete.
On: sample3
Loading database information... done.
4283096 sequences (1199.29 Mbp) processed in 52.500s (4895.0 Kseq/m, 1370.62 Mbp/m).
3949426 sequences classified (92.21%)
333670 sequences unclassified (7.79%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample3-kraken2-report.txt -o sample3-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 20:54:10
BRACKEN SUMMARY (Kraken report: sample3-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 3245
>> Number of species with reads > threshold: 3245
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4283096
>> Total reads kept at species level (reads > threshold): 2816840
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 1131837
>> Reads not distributed (eg. no species above threshold): 749
>> Unclassified reads: 333670
BRACKEN OUTPUT PRODUCED: sample3-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 20:54:11
Bracken complete.
On: sample4
Loading database information... done.
5270348 sequences (1444.14 Mbp) processed in 67.224s (4704.0 Kseq/m, 1288.96 Mbp/m).
4102225 sequences classified (77.84%)
1168123 sequences unclassified (22.16%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample4-kraken2-report.txt -o sample4-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 21:02:14
BRACKEN SUMMARY (Kraken report: sample4-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 3588
>> Number of species with reads > threshold: 3588
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 5270348
>> Total reads kept at species level (reads > threshold): 2929872
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 1171564
>> Reads not distributed (eg. no species above threshold): 789
>> Unclassified reads: 1168123
BRACKEN OUTPUT PRODUCED: sample4-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 21:02:14
Bracken complete.
On: sample5
Loading database information... done.
4998731 sequences (1259.68 Mbp) processed in 45.906s (6533.5 Kseq/m, 1646.43 Mbp/m).
3853084 sequences classified (77.08%)
1145647 sequences unclassified (22.92%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample5-kraken2-report.txt -o sample5-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 21:08:26
BRACKEN SUMMARY (Kraken report: sample5-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 4623
>> Number of species with reads > threshold: 4623
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4998731
>> Total reads kept at species level (reads > threshold): 3289608
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 563285
>> Reads not distributed (eg. no species above threshold): 191
>> Unclassified reads: 1145647
BRACKEN OUTPUT PRODUCED: sample5-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 21:08:26
Bracken complete.
On: sample6
Loading database information... done.
4998731 sequences (1249.68 Mbp) processed in 37.057s (8093.5 Kseq/m, 2023.38 Mbp/m).
3842226 sequences classified (76.86%)
1156505 sequences unclassified (23.14%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample6-kraken2-report.txt -o sample6-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 21:13:36
BRACKEN SUMMARY (Kraken report: sample6-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 4514
>> Number of species with reads > threshold: 4514
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4998731
>> Total reads kept at species level (reads > threshold): 3273618
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 568384
>> Reads not distributed (eg. no species above threshold): 224
>> Unclassified reads: 1156505
BRACKEN OUTPUT PRODUCED: sample6-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 21:13:37
Bracken complete.
On: sample7
Loading database information... done.
4998734 sequences (1259.68 Mbp) processed in 43.783s (6850.3 Kseq/m, 1726.27 Mbp/m).
3813625 sequences classified (76.29%)
1185109 sequences unclassified (23.71%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample7-kraken2-report.txt -o sample7-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 21:18:22
BRACKEN SUMMARY (Kraken report: sample7-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 4433
>> Number of species with reads > threshold: 4433
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4998734
>> Total reads kept at species level (reads > threshold): 3347470
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 465937
>> Reads not distributed (eg. no species above threshold): 218
>> Unclassified reads: 1185109
BRACKEN OUTPUT PRODUCED: sample7-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 21:18:22
Bracken complete.
On: sample8
Loading database information... done.
4998734 sequences (1249.68 Mbp) processed in 37.208s (8060.7 Kseq/m, 2015.18 Mbp/m).
3801365 sequences classified (76.05%)
1197369 sequences unclassified (23.95%)
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i sample8-kraken2-report.txt -o sample8-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0
PROGRAM START TIME: 08-21-2020 21:22:35
BRACKEN SUMMARY (Kraken report: sample8-kraken2-report.txt)
>>> Threshold: 0
>>> Number of species in sample: 4328
>> Number of species with reads > threshold: 4328
>> Number of species with reads < threshold: 0
>>> Total reads in sample: 4998734
>> Total reads kept at species level (reads > threshold): 3331565
>> Total reads discarded (species reads < threshold): 0
>> Reads distributed: 469565
>> Reads not distributed (eg. no species above threshold): 235
>> Unclassified reads: 1197369
BRACKEN OUTPUT PRODUCED: sample8-bracken-out.tsv
PROGRAM END TIME: 08-21-2020 21:22:36
Bracken complete.
1
118
5.0G
```
### Usage Info
```
22:44.40 real, 1447.33 user, 368.05 sys, 49824956 max_mem, 0 av_mem, 133% cpu
```