# Kraken 2/Bracken ### Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm. ## Install Kraken 2 and Bracken: `conda create -y -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta bracken=2.6.0` ## Download the database: `scp -r astrobio@149.165.170.83:/vol_b/kraken2-db/ .` ## Download the dataset: `curl -L -o metagenomic-read-files.tar.gz https://ndownloader.figshare.com/files/24079451` ## Unpack the dataset: ``` tar -xzvf metagenomic-read-files.tar.gz #Setting up the parameters: kraken2 --db kraken2-db/ --threads 6 \ --output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \ --paired sample-1-R1.fq.gz sample-1-R2.fq.gz ``` ## Output error1: ``` Command 'kraken2' not found, did you mean: command 'kraken' from deb kraken Try: apt install <deb name> ``` ## Output 1 (Results - Bracken): ``` Loading database information... done. 2 sequences (0.00 Mbp) processed in 0.001s (80.8 Kseq/m, 25.84 Mbp/m). 2 sequences classified (100.00%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample-1-kraken2-report.txt -o sample-1-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-05-2020 22:27:28 BRACKEN SUMMARY (Kraken report: sample-1-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 1 >> Number of species with reads > threshold: 1 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 2 >> Total reads kept at species level (reads > threshold): 1 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 1 >> Reads not distributed (eg. no species above threshold): 0 >> Unclassified reads: 0 BRACKEN OUTPUT PRODUCED: sample-1-bracken-out.tsv PROGRAM END TIME: 08-05-2020 22:27:28 Bracken complete. ``` ## Automation Process ### Making the sample-list.txt file ``` for sample in $(cat sample-list.txt) do echo $sample # what we want the kraken --output file to be echo "This is what the regular kraken output file will be named: ${sample}-kraken2-out.txt" # what we want the kraken --report file to be echo "This is what the kraken report file will be named: ${sample}-kraken2-report.txt" # trying to pass the forward read file echo "This is where we think the input forward read file is: ${sample}-R1.fq.gz" # we can check it exists with the ls command ls ${sample}-R1.fq.gz # trying to pass the reverse read file echo "This is where we think the input reverse read file is: ${sample}-R2.fq.gz" # we can check it exists with the ls command ls ${sample}-R2.fq.gz # and just adding so space between each iteration echo "" echo "" done ``` ### Full Automation ``` for sample in $(cat sample-list.txt) do echo On: $sample kraken2 --db kraken2-db/ --threads 6 \ --output ${sample}-kraken2-out.txt --report ${sample}-kraken2-report.txt \ --paired metagenomic-read-files/${sample}_R1_trimmed.fastq.gz metagenomic-read-files/${sample}_R2_trimmed.fastq.gz bracken -r 150 -d kraken2-db/ -i ${sample}-kraken2-report.txt -o ${sample}-bracken-out.tsv done CpuUse=0 MemUse=0 DiskUse=$(du -sh metagenomic-read-files | awk '{print $1}') Count=0 tempCPU=$(top -H -b -n1 -d1 | grep "Threads" | awk '{print $4}') tempMem=$(free -g |grep "Mem:" | awk '{print $2}') CpuUse=`expr $CpuUse + $tempCPU` if [ $tempMem -gt $MemUse ] then MemUse=$tempMem fi Count=`expr $Count + 1` echo `expr $CpuUse / $Count` echo $MemUse echo $DiskUse ``` ### Time Command ``` command time -f "\t%E real, \t%U user, \t%S sys, \t%M max_mem, \t%K av_mem, \t%P cpu" -o usage-info.txt bash kraken-bracken.sh ``` ## Results ``` On: sample1 Loading database information... done. 4283096 sequences (1199.29 Mbp) processed in 48.627s (5284.8 Kseq/m, 1479.78 Mbp/m). 3949426 sequences classified (92.21%) 333670 sequences unclassified (7.79%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample1-kraken2-report.txt -o sample1-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 20:38:06 BRACKEN SUMMARY (Kraken report: sample1-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 3245 >> Number of species with reads > threshold: 3245 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4283096 >> Total reads kept at species level (reads > threshold): 2816840 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 1131837 >> Reads not distributed (eg. no species above threshold): 749 >> Unclassified reads: 333670 BRACKEN OUTPUT PRODUCED: sample1-bracken-out.tsv PROGRAM END TIME: 08-21-2020 20:38:06 Bracken complete. On: sample2 Loading database information... done. 4327888 sequences (1202.67 Mbp) processed in 51.709s (5021.8 Kseq/m, 1395.50 Mbp/m). 4049074 sequences classified (93.56%) 278814 sequences unclassified (6.44%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample2-kraken2-report.txt -o sample2-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 20:46:21 BRACKEN SUMMARY (Kraken report: sample2-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 3092 >> Number of species with reads > threshold: 3092 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4327888 >> Total reads kept at species level (reads > threshold): 2811244 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 1237198 >> Reads not distributed (eg. no species above threshold): 632 >> Unclassified reads: 278814 BRACKEN OUTPUT PRODUCED: sample2-bracken-out.tsv PROGRAM END TIME: 08-21-2020 20:46:21 Bracken complete. On: sample3 Loading database information... done. 4283096 sequences (1199.29 Mbp) processed in 52.500s (4895.0 Kseq/m, 1370.62 Mbp/m). 3949426 sequences classified (92.21%) 333670 sequences unclassified (7.79%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample3-kraken2-report.txt -o sample3-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 20:54:10 BRACKEN SUMMARY (Kraken report: sample3-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 3245 >> Number of species with reads > threshold: 3245 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4283096 >> Total reads kept at species level (reads > threshold): 2816840 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 1131837 >> Reads not distributed (eg. no species above threshold): 749 >> Unclassified reads: 333670 BRACKEN OUTPUT PRODUCED: sample3-bracken-out.tsv PROGRAM END TIME: 08-21-2020 20:54:11 Bracken complete. On: sample4 Loading database information... done. 5270348 sequences (1444.14 Mbp) processed in 67.224s (4704.0 Kseq/m, 1288.96 Mbp/m). 4102225 sequences classified (77.84%) 1168123 sequences unclassified (22.16%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample4-kraken2-report.txt -o sample4-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 21:02:14 BRACKEN SUMMARY (Kraken report: sample4-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 3588 >> Number of species with reads > threshold: 3588 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 5270348 >> Total reads kept at species level (reads > threshold): 2929872 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 1171564 >> Reads not distributed (eg. no species above threshold): 789 >> Unclassified reads: 1168123 BRACKEN OUTPUT PRODUCED: sample4-bracken-out.tsv PROGRAM END TIME: 08-21-2020 21:02:14 Bracken complete. On: sample5 Loading database information... done. 4998731 sequences (1259.68 Mbp) processed in 45.906s (6533.5 Kseq/m, 1646.43 Mbp/m). 3853084 sequences classified (77.08%) 1145647 sequences unclassified (22.92%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample5-kraken2-report.txt -o sample5-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 21:08:26 BRACKEN SUMMARY (Kraken report: sample5-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 4623 >> Number of species with reads > threshold: 4623 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4998731 >> Total reads kept at species level (reads > threshold): 3289608 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 563285 >> Reads not distributed (eg. no species above threshold): 191 >> Unclassified reads: 1145647 BRACKEN OUTPUT PRODUCED: sample5-bracken-out.tsv PROGRAM END TIME: 08-21-2020 21:08:26 Bracken complete. On: sample6 Loading database information... done. 4998731 sequences (1249.68 Mbp) processed in 37.057s (8093.5 Kseq/m, 2023.38 Mbp/m). 3842226 sequences classified (76.86%) 1156505 sequences unclassified (23.14%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample6-kraken2-report.txt -o sample6-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 21:13:36 BRACKEN SUMMARY (Kraken report: sample6-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 4514 >> Number of species with reads > threshold: 4514 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4998731 >> Total reads kept at species level (reads > threshold): 3273618 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 568384 >> Reads not distributed (eg. no species above threshold): 224 >> Unclassified reads: 1156505 BRACKEN OUTPUT PRODUCED: sample6-bracken-out.tsv PROGRAM END TIME: 08-21-2020 21:13:37 Bracken complete. On: sample7 Loading database information... done. 4998734 sequences (1259.68 Mbp) processed in 43.783s (6850.3 Kseq/m, 1726.27 Mbp/m). 3813625 sequences classified (76.29%) 1185109 sequences unclassified (23.71%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample7-kraken2-report.txt -o sample7-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 21:18:22 BRACKEN SUMMARY (Kraken report: sample7-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 4433 >> Number of species with reads > threshold: 4433 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4998734 >> Total reads kept at species level (reads > threshold): 3347470 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 465937 >> Reads not distributed (eg. no species above threshold): 218 >> Unclassified reads: 1185109 BRACKEN OUTPUT PRODUCED: sample7-bracken-out.tsv PROGRAM END TIME: 08-21-2020 21:18:22 Bracken complete. On: sample8 Loading database information... done. 4998734 sequences (1249.68 Mbp) processed in 37.208s (8060.7 Kseq/m, 2015.18 Mbp/m). 3801365 sequences classified (76.05%) 1197369 sequences unclassified (23.95%) >> Checking for Valid Options... >> Running Bracken >> python src/est_abundance.py -i sample8-kraken2-report.txt -o sample8-bracken-out.tsv -k kraken2-db/database150mers.kmer_distrib -l S -t 0 PROGRAM START TIME: 08-21-2020 21:22:35 BRACKEN SUMMARY (Kraken report: sample8-kraken2-report.txt) >>> Threshold: 0 >>> Number of species in sample: 4328 >> Number of species with reads > threshold: 4328 >> Number of species with reads < threshold: 0 >>> Total reads in sample: 4998734 >> Total reads kept at species level (reads > threshold): 3331565 >> Total reads discarded (species reads < threshold): 0 >> Reads distributed: 469565 >> Reads not distributed (eg. no species above threshold): 235 >> Unclassified reads: 1197369 BRACKEN OUTPUT PRODUCED: sample8-bracken-out.tsv PROGRAM END TIME: 08-21-2020 21:22:36 Bracken complete. 1 118 5.0G ``` ### Usage Info ``` 22:44.40 real, 1447.33 user, 368.05 sys, 49824956 max_mem, 0 av_mem, 133% cpu ```