Looking into seemingly high proportion of *B. anthracis* re-assignment by bracken

--- tags: cov-irt --- # Looking into seemingly high proportion of *B. anthracis* re-assignment by bracken **Table of Contents** [TOC] Mostly happening with CRR samples from Shen et al. 2020. One of the highest is [CRR125950](https://ngdc.cncb.ac.cn/gsa/browse/CRA002476/CRR125995). ## Looking into CRR125950 ### Summary kraken2 is classifying 571 reads to *B. anthracis*: ```bash grep "Bacillus anthracis" H01_CRR125950.report | awk -F $'\t' ' $4 == "S" ' | cut -f 2 | sed 's/^/# /' # 571 ``` Bracken gives us back 252,471 reads classified to *B. anthracis* (~20% of the total): ```bash grep "^name\|Bacillus anthracis" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | column | sed 's/^/# /' # name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads # Bacillus anthracis 1392 S 571 251900 252471 0.20836 ``` Bracken is supposed to do this, in the sense that kraken2 was built to as confidently as possible assign a single read to the lowest taxonmic rank it can, and bracken is supposed to bump those assignments down what it thinks is the most likely source based on all the kraken2 data at the rank we request. But this is quite a jump that stood out to us. **So we want to know:** 1) How far up taxonomically is bracken going to collapse kraken2's assignments down to this species? 2) Does it seem appropriate? ### Current conclusion **After the looking below:** 1. It is going *very* far up taxonomically, like above the Firmicutes phylum. I'm not totally sure how it's driving down specifically to *B. anthracis*. 2. For now, based on what's shown on this page, I don't think it's appropriate. There are still more moving parts than I've been able to spend the time pinning down (specifically, why bracken is pushing so many reads down to *B. anthracis*), so I'm not at a position where I'd open an issue or report it to the devs, as it's possible this is just a consequence of us asking bracken to do an impossible task – push short-read taxonomic classifications down to lower ranks than were distinguishable. But I think we should jump ship on it for this work. --- ### Work #### Getting some relevant files **Downloading our bracken species-level output** ```bash curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 https://osf.io/qsbex/download ``` **Downloading our bracken genus-level output** ```bash curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-G_t-0 https://osf.io/2xkvt/download ``` **Downloading our kraken2 read-based output file** ```bash curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out https://osf.io/cags8/download ``` **Downloading our kraken2 report output file** ```bash curl -L -o H01_CRR125950.report https://osf.io/uns54/download ``` **Downloading our kraken2-classified reads** ```bash curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz https://osf.io/3x4j8/download ``` **Downloading starting forward read file from [here](https://bigd.big.ac.cn/gsa/browse/CRA002476/CRR125950)** ```bash curl -LO ftp://download.big.ac.cn/gsa3/CRA002476/CRR125950/CRR125950_f1.fq.gz ``` --- #### Peeking at things ##### Bracken species-level output **Bracken says there were 4,676 assignments to the *Bacillus* genus by kraken2** (according to the *species*-level bracken output) ```bash grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | awk -F $'\t' ' { sum += $4 } END { print sum } ' | sed 's/^/# /' # 4676 ``` ***B. anthracis* was not the *Bacillus* species with the most reads assigned by kraken2, *B. thuringiensis* was** * *B. thuringiensis* had 1,389 reads assigned by kraken2, and 22,143 added by bracken * *B. anthracis* had 571 reads assigned by kraken2, and 251,900 added by bracken ```bash cat <( head -n 1 CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 ) <( grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | sort -t $'\t' -nrk 4 ) | head | column | sed 's/^/# /' # name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads # Bacillus thuringiensis 1428 S 1389 22143 23532 0.01942 # Bacillus anthracis 1392 S 571 251900 252471 0.20836 # Bacillus coagulans 1398 S 502 222 724 0.00060 # Bacillus cereus 1396 S 416 19407 19823 0.01636 # Bacillus megaterium 1404 S 322 571 893 0.00074 # Bacillus pseudomycoides 64104 S 299 502 801 0.00066 # Bacillus paralicheniformis 1648923 S 143 111 254 0.00021 # Bacillus sp. BD59S 2499213 S 138 395 533 0.00044 # Bacillus subtilis 1423 S 119 20476 20595 0.01700 ``` ##### Bracken genus-level output **Bracken says there were 5,922 assignments to the *Bacillus* genus by kraken2** (according to the *genus*-level bracken output) ```bash grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-G_t-0 | awk -F $'\t' ' { sum += $4 } END { print sum } ' | sed 's/^/# /' # 5922 ``` ##### Kraken2 output **Out of 2,333,954 input reads, 1,211,752 were classified at all by kraken2.** ```bash wc -l CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | sed 's/^/# /' # 2333954 CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out grep -c "^C" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | sed 's/^/# /' # 1211752 ``` **5,922 were classified to or under the genus *Bacillus*** ```bash grep "Bacillus" H01_CRR125950.report | awk -F $'\t' ' $4 == "G" ' | cut -f 2 | sed 's/^/# /' # 5922 ``` **7,865 were classified to or under the family Bacillaceae** ```bash grep "Bacillaceae" H01_CRR125950.report | awk -F $'\t' ' $4 == "F" ' | cut -f 2 | sed 's/^/# /' # 7865 ``` **26,659 were classified to or under the order Bacillales** ```bash grep "Bacillales" H01_CRR125950.report | awk -F $'\t' ' $4 == "O" ' | cut -f 2 | sed 's/^/# /' # 26659 ``` **48,606 were classified to or under the class Bacilli** ```bash grep "Bacilli" H01_CRR125950.report | awk -F $'\t' ' $4 == "C" ' | cut -f 2 | sed 's/^/# /' ``` **55,740 were classified to or under the phylum Firmicutes** ```bash grep "Firmicutes" H01_CRR125950.report | awk -F $'\t' ' $4 == "P" ' | cut -f 2 | sed 's/^/# /' # 55740 ``` **131,921 were classified to or under the "D1" Terrabacteria group that holds the phylum Firmicutes** ```bash grep "Terrabacteria group" H01_CRR125950.report | awk -F $'\t' ' $4 == "D1" ' | cut -f 2 | sed 's/^/# /' # 131921 ``` So even there, we are not at the 252,471 reads that were summarized at the species-level by bracken as *B. anthracis*. **1,203,756 reads were classified to or under the domain Bacteria** ```bash grep "Bacteria" H01_CRR125950.report | awk -F $'\t' ' $4 == "D" ' | cut -f 2 | sed 's/^/# /' # 1203756 ``` > **So bracken is trying to bump things down quite a bit it seems when we ask for the species-level or genus-level summary.** ##### Reads classified as *B. anthracis* by kraken2 Getting headers: ```bash grep -w anthracis CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | cut -f 2 > B-anthracis-kraken-classified-headers.txt ``` Pulling out seqs: ```bash # turning into fasta first so can use my bit-parse-fasta-by-headers program (https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) sed 's/^@/>/' <( zcat CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz )| paste - - - - | cut -f 1,2 | tr "\t" "\n" > CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fa bit-parse-fasta-by-headers -i CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fa -w B-anthracis-kraken-classified-headers.txt -o B-anthracis-kraken-classified.fa ``` That pulls out the 571, and looking at them, they almost all contain long homopolymers: 220 of the total 571 have at least 25 T's in a row: ```bash grep -c "TTTTTTTTTTTTTTTTTTTTTTTTT" B-anthracis-kraken-classified.fa | sed 's/^/# /' # 220 ``` 317 have at least 25 A's in a row: ```bash grep -c "AAAAAAAAAAAAAAAAAAAAAAAAA" B-anthracis-kraken-classified.fa | sed 's/^/# /' # 317 ``` --- #### Mapping classified reads to *B. anthracis* and *B. thuringiensis* representative genomes The following-used programs can be installed in a new conda environment and activated as such: ```bash conda create -y -n tax-probing -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.09 bowtie2=2.4.1 blast=2.10.1 conda activate tax-probing ``` **Getting *B. anthracis* rep. genome accession and downloading genome** ```bash esearch -query '"Bacillus anthracis"[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > b-anthracis-acc.txt bit-dl-ncbi-assemblies -w b-anthracis-acc.txt -f fasta ``` **Getting *B. thuringiensis* rep. genome accession and downloading** ```bash esearch -query '"Bacillus thuringiensis"[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > b-thuringiensis-acc.txt bit-dl-ncbi-assemblies -w b-thuringiensis-acc.txt -f fasta ``` **Unzipping and renaming headers for easier parsing later** ```bash gunzip GCF_000008445.1.fa.gz gunzip GCF_000161495.1.fa.gz bit-rename-fasta-headers -i GCF_000008445.1.fa -w B_anthracis -o B_anthracis.fa && rm GCF_000008445.1.fa bit-rename-fasta-headers -i GCF_000161495.1.fa -w B_thuringiensis -o B_thuringiensis.fa && rm GCF_000161495.1.fa ``` **Checking out homopolymers we saw above** *B. anthracis* ref genome has at least one stretch of 36 As in it: ```bash grep -c "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" B_anthracis.fa | sed 's/^/# /' # 1 ``` And at least 2 of 15 Ts: ```bash grep -c "TTTTTTTTTTTTTTT" B_anthracis.fa | sed 's/^/# /' # 2 ``` *B. thuringiensis* ref genome has at least one stretch of 11 As in it: ```bash grep -c "AAAAAAAAAAA" B_thuringiensis.fa | sed 's/^/# /' # 1 ``` And at least 1 of 11 Ts: ```bash grep -c "TTTTTTTTTTT" B_thuringiensis.fa | sed 's/^/# /' # 1 ``` I suspect the only really problematic one here is the 36 As in the *B. anthracis* genome. **Combining together, making bowtie2 db of both, and mapping** ```bash cat B_anthracis.fa B_thuringiensis.fa > Bacillus-rep-genomes.fa bowtie2-build Bacillus-rep-genomes.fa Bacillus-rep-genomes-index bowtie2 -x Bacillus-rep-genomes-index -U CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz -p 4 --al aligned-reads.fq --no-unal > /dev/null # converting reads to fasta format sed 's/^@/>/' aligned-reads.fq | paste - - - - | cut -f 1,2 | tr "\t" "\n" > aligned-reads.fa ``` 26,425 sequences aligned to either of these based on bowtie2 defaults. ```bash grep -c ">" aligned-reads.fa | sed 's/^/# /' ``` While that's still an order of magnitude lower than the 252,471 that were assigned to *B. anthracis* by bracken, bowtie2 is pretty stringent relatively speaking, so maybe this isn't a big deal yet. Let's see what the percent identities of the recruited reads look like. #### SIDE NOTE There is more to think about here, almost all of the seqs that successfuly mapped to the refs are extremely low-complexity. 20,499 of the total ~26,000 have 25 T's in a row: ```bash grep -c "TTTTTTTTTTTTTTTTTTTTTTTTT" aligned-reads.fa | sed 's/^/# /' # 20499 ``` 5,143 have 25 A's in a row: ```bash grep -c "AAAAAAAAAAAAAAAAAAAAAAAAA" aligned-reads.fa | sed 's/^/# /' # 5143 ``` To be sure, something odd would still need to be going on with bracken (if there is a problem), but this is probably something we should filter out earlier, like after trimmomatic since I'm not seeing a way to do it with that. I'm looking at [fastp](https://github.com/OpenGene/fastp#low-complexity-filter) at the moment as an option. (With this in mind, I'm not sure how informative the following blasting approach matters, as we are mostly blasting uninformative seqs – unless we get very few hits because the uninformative ones don't align properly.) #### BLASTing successfully mapped reads against genomes ```bash makeblastdb -dbtype nucl -in Bacillus-rep-genomes.fa -out Bacillus-rep-genomes-BLAST-db blastn -task dc-megablast -query aligned-reads.fa -db Bacillus-rep-genomes-BLAST-db -outfmt "6 qseqid qlen sseqid slen length qcovus pident evalue bitscore" -max_hsps 1 -max_target_seqs 1 | sort -nrk 8 > blast_out.tmp && cat <(printf "qseqid\tqlen\tsseqid\tslen\tlength\tqcovus\tpident\tevalue\tbitscore\n") blast_out.tmp > blast-out.tsv && rm blast_out.tmp ``` 568 successfully aligned based on dc-megablast (probably so few due to the uninformative mass noted above): ```bash tail -n +2 blast-out.tsv | wc -l | sed 's/^/# /' # 568 ``` 181 to *B. anthracis* ```bash grep "B_anthracis" blast-out.tsv | wc -l | sed 's/^/# /' # 181 ``` 387 to *B. thuringiensis* ```bash grep "B_thuringiensis" blast-out.tsv | wc -l | sed 's/^/# /' # 387 ``` **Looking at percent identities of each** ```bash grep "B_anthracis" blast-out.tsv | awk -F $'\t' ' $6 > 50 ' | cut -f 7 > b-anthracis-blast-pidents.txt grep "B_thuringiensis" blast-out.tsv | awk -F $'\t' ' $6 > 50 ' | cut -f 7 > b-thuringiensis-blast-pidents.txt ``` Getting summary of distributions from R: ```bash cat > summary.R << 'EOF' args = commandArgs(trailingOnly=TRUE) vec = scan(args[1], quiet=TRUE) summary(vec) EOF ``` ```bash Rscript summary.R b-anthracis-blast-pidents.txt | sed 's/^/# /' # Min. 1st Qu. Median Mean 3rd Qu. Max. # 87.84 90.97 91.77 91.64 92.48 94.22 ``` ```bash Rscript summary.R b-thuringiensis-blast-pidents.txt | sed 's/^/# /' # Min. 1st Qu. Median Mean 3rd Qu. Max. # 87.92 89.37 91.67 92.05 94.06 100.00 ``` > Both look pretty similar in their distributions. With *B. thuringiensis* having more based on kraken2 and based on alignment. ### Visualizing recruitment in anvi'o ``` ## anvio 6.2 anvi-gen-contigs-database -f Bacillus-rep-genomes.fa -n Bacillus_poking -o Bacillus-rep-genomes.db --skip-gene-calling # converting, sorting, and indexing mapping file samtools view -b aligned.sam | samtools sort -@ 4 > aligned.bam samtools index -@ 4 aligned.bam # profiling # needed to jump to master to do this after --skip-gene-calling currently (https://github.com/merenlab/anvio/issues/1406) anvi-profile -i aligned.bam -c Bacillus-rep-genomes.db -o aligned-profile -T 4 # back to anvi'o 6.2 # making items order file to be able to visualize without any coverage or TNF clustering anvi-export-table --table splits_basic_info Bacillus-rep-genomes.db tail -n +2 splits_basic_info.txt | cut -f 1 > items-order.txt anvi-import-items-order -i items-order.txt -p aligned-profile/PROFILE.db --name arranged --make-default # making collection to color the genomes paste items-order.txt <( cut -f 1,2 -d "_" items-order.txt ) > collection.tsv anvi-import-collection -c Bacillus-rep-genomes.db -p aligned-profile/PROFILE.db -C genomes collection.tsv ``` As seen [above](https://hackmd.io/KOYV61V7Q5mPm09ScZ6XsA?both#Reads-classified-as-B-anthracis-by-kraken2) by looking at the reads that were classified by kraken2 as *B. anthracis* and seeing they virtually all had giant homopolymer stretches of A/T, virtually all the recruitment is to one spot: <a href="https://i.imgur.com/f3i3bBb.png"><img src="https://i.imgur.com/f3i3bBb.png"></a> With 51X mean coverage, but only 0.3% [detection](http://merenlab.org/2017/05/08/anvio-views/#detection). And if we look at that split more closely, we can see all reads are mapping to the one spot with the homopolymer in the ref genome: <a href="https://i.imgur.com/XlUi805.png"><img src="https://i.imgur.com/XlUi805.png"></a> <br> Zoomed in: <a href="https://i.imgur.com/hdVS6yE.png"><img src="https://i.imgur.com/hdVS6yE.png"></a> Looking at the sequence: <a href="https://i.imgur.com/CFSuraJ.png"><img src="https://i.imgur.com/CFSuraJ.png"></a> --- **Neither *B. anthracis* nor *B. thuringiensis* is in this sample.** <a href="https://i.imgur.com/ZhCbXyc.png"><img src="https://i.imgur.com/ZhCbXyc.png"></a> ---