---
tags: cov-irt
---
# Looking into seemingly high proportion of *B. anthracis* re-assignment by bracken
**Table of Contents**
[TOC]
Mostly happening with CRR samples from Shen et al. 2020. One of the highest is [CRR125950](https://ngdc.cncb.ac.cn/gsa/browse/CRA002476/CRR125995).
## Looking into CRR125950
### Summary
kraken2 is classifying 571 reads to *B. anthracis*:
```bash
grep "Bacillus anthracis" H01_CRR125950.report | awk -F $'\t' ' $4 == "S" ' | cut -f 2 | sed 's/^/# /'
# 571
```
Bracken gives us back 252,471 reads classified to *B. anthracis* (~20% of the total):
```bash
grep "^name\|Bacillus anthracis" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | column | sed 's/^/# /'
# name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
# Bacillus anthracis 1392 S 571 251900 252471 0.20836
```
Bracken is supposed to do this, in the sense that kraken2 was built to as confidently as possible assign a single read to the lowest taxonmic rank it can, and bracken is supposed to bump those assignments down what it thinks is the most likely source based on all the kraken2 data at the rank we request. But this is quite a jump that stood out to us.
**So we want to know:**
1) How far up taxonomically is bracken going to collapse kraken2's assignments down to this species?
2) Does it seem appropriate?
### Current conclusion
**After the looking below:**
1. It is going *very* far up taxonomically, like above the Firmicutes phylum. I'm not totally sure how it's driving down specifically to *B. anthracis*.
2. For now, based on what's shown on this page, I don't think it's appropriate. There are still more moving parts than I've been able to spend the time pinning down (specifically, why bracken is pushing so many reads down to *B. anthracis*), so I'm not at a position where I'd open an issue or report it to the devs, as it's possible this is just a consequence of us asking bracken to do an impossible task – push short-read taxonomic classifications down to lower ranks than were distinguishable. But I think we should jump ship on it for this work.
---
### Work
#### Getting some relevant files
**Downloading our bracken species-level output**
```bash
curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 https://osf.io/qsbex/download
```
**Downloading our bracken genus-level output**
```bash
curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-G_t-0 https://osf.io/2xkvt/download
```
**Downloading our kraken2 read-based output file**
```bash
curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out https://osf.io/cags8/download
```
**Downloading our kraken2 report output file**
```bash
curl -L -o H01_CRR125950.report https://osf.io/uns54/download
```
**Downloading our kraken2-classified reads**
```bash
curl -L -o CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz https://osf.io/3x4j8/download
```
**Downloading starting forward read file from [here](https://bigd.big.ac.cn/gsa/browse/CRA002476/CRR125950)**
```bash
curl -LO ftp://download.big.ac.cn/gsa3/CRA002476/CRR125950/CRR125950_f1.fq.gz
```
---
#### Peeking at things
##### Bracken species-level output
**Bracken says there were 4,676 assignments to the *Bacillus* genus by kraken2** (according to the *species*-level bracken output)
```bash
grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | awk -F $'\t' ' { sum += $4 } END { print sum } ' | sed 's/^/# /'
# 4676
```
***B. anthracis* was not the *Bacillus* species with the most reads assigned by kraken2, *B. thuringiensis* was**
* *B. thuringiensis* had 1,389 reads assigned by kraken2, and 22,143 added by bracken
* *B. anthracis* had 571 reads assigned by kraken2, and 251,900 added by bracken
```bash
cat <( head -n 1 CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 ) <( grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-S_t-0 | sort -t $'\t' -nrk 4 ) | head | column | sed 's/^/# /'
# name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
# Bacillus thuringiensis 1428 S 1389 22143 23532 0.01942
# Bacillus anthracis 1392 S 571 251900 252471 0.20836
# Bacillus coagulans 1398 S 502 222 724 0.00060
# Bacillus cereus 1396 S 416 19407 19823 0.01636
# Bacillus megaterium 1404 S 322 571 893 0.00074
# Bacillus pseudomycoides 64104 S 299 502 801 0.00066
# Bacillus paralicheniformis 1648923 S 143 111 254 0.00021
# Bacillus sp. BD59S 2499213 S 138 395 533 0.00044
# Bacillus subtilis 1423 S 119 20476 20595 0.01700
```
##### Bracken genus-level output
**Bracken says there were 5,922 assignments to the *Bacillus* genus by kraken2** (according to the *genus*-level bracken output)
```bash
grep "^Bacillus" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0_bracken_db-mason_db_r-150_l-G_t-0 | awk -F $'\t' ' { sum += $4 } END { print sum } ' | sed 's/^/# /'
# 5922
```
##### Kraken2 output
**Out of 2,333,954 input reads, 1,211,752 were classified at all by kraken2.**
```bash
wc -l CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | sed 's/^/# /'
# 2333954 CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out
grep -c "^C" CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | sed 's/^/# /'
# 1211752
```
**5,922 were classified to or under the genus *Bacillus***
```bash
grep "Bacillus" H01_CRR125950.report | awk -F $'\t' ' $4 == "G" ' | cut -f 2 | sed 's/^/# /'
# 5922
```
**7,865 were classified to or under the family Bacillaceae**
```bash
grep "Bacillaceae" H01_CRR125950.report | awk -F $'\t' ' $4 == "F" ' | cut -f 2 | sed 's/^/# /'
# 7865
```
**26,659 were classified to or under the order Bacillales**
```bash
grep "Bacillales" H01_CRR125950.report | awk -F $'\t' ' $4 == "O" ' | cut -f 2 | sed 's/^/# /'
# 26659
```
**48,606 were classified to or under the class Bacilli**
```bash
grep "Bacilli" H01_CRR125950.report | awk -F $'\t' ' $4 == "C" ' | cut -f 2 | sed 's/^/# /'
```
**55,740 were classified to or under the phylum Firmicutes**
```bash
grep "Firmicutes" H01_CRR125950.report | awk -F $'\t' ' $4 == "P" ' | cut -f 2 | sed 's/^/# /'
# 55740
```
**131,921 were classified to or under the "D1" Terrabacteria group that holds the phylum Firmicutes**
```bash
grep "Terrabacteria group" H01_CRR125950.report | awk -F $'\t' ' $4 == "D1" ' | cut -f 2 | sed 's/^/# /'
# 131921
```
So even there, we are not at the 252,471 reads that were summarized at the species-level by bracken as *B. anthracis*.
**1,203,756 reads were classified to or under the domain Bacteria**
```bash
grep "Bacteria" H01_CRR125950.report | awk -F $'\t' ' $4 == "D" ' | cut -f 2 | sed 's/^/# /'
# 1203756
```
> **So bracken is trying to bump things down quite a bit it seems when we ask for the species-level or genus-level summary.**
##### Reads classified as *B. anthracis* by kraken2
Getting headers:
```bash
grep -w anthracis CRR125950_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | cut -f 2 > B-anthracis-kraken-classified-headers.txt
```
Pulling out seqs:
```bash
# turning into fasta first so can use my bit-parse-fasta-by-headers program (https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)
sed 's/^@/>/' <( zcat CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz )| paste - - - - | cut -f 1,2 | tr "\t" "\n" > CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fa
bit-parse-fasta-by-headers -i CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fa -w B-anthracis-kraken-classified-headers.txt -o B-anthracis-kraken-classified.fa
```
That pulls out the 571, and looking at them, they almost all contain long homopolymers:
220 of the total 571 have at least 25 T's in a row:
```bash
grep -c "TTTTTTTTTTTTTTTTTTTTTTTTT" B-anthracis-kraken-classified.fa | sed 's/^/# /'
# 220
```
317 have at least 25 A's in a row:
```bash
grep -c "AAAAAAAAAAAAAAAAAAAAAAAAA" B-anthracis-kraken-classified.fa | sed 's/^/# /'
# 317
```
---
#### Mapping classified reads to *B. anthracis* and *B. thuringiensis* representative genomes
The following-used programs can be installed in a new conda environment and activated as such:
```bash
conda create -y -n tax-probing -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.09 bowtie2=2.4.1 blast=2.10.1
conda activate tax-probing
```
**Getting *B. anthracis* rep. genome accession and downloading genome**
```bash
esearch -query '"Bacillus anthracis"[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > b-anthracis-acc.txt
bit-dl-ncbi-assemblies -w b-anthracis-acc.txt -f fasta
```
**Getting *B. thuringiensis* rep. genome accession and downloading**
```bash
esearch -query '"Bacillus thuringiensis"[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > b-thuringiensis-acc.txt
bit-dl-ncbi-assemblies -w b-thuringiensis-acc.txt -f fasta
```
**Unzipping and renaming headers for easier parsing later**
```bash
gunzip GCF_000008445.1.fa.gz
gunzip GCF_000161495.1.fa.gz
bit-rename-fasta-headers -i GCF_000008445.1.fa -w B_anthracis -o B_anthracis.fa && rm GCF_000008445.1.fa
bit-rename-fasta-headers -i GCF_000161495.1.fa -w B_thuringiensis -o B_thuringiensis.fa && rm GCF_000161495.1.fa
```
**Checking out homopolymers we saw above**
*B. anthracis* ref genome has at least one stretch of 36 As in it:
```bash
grep -c "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" B_anthracis.fa | sed 's/^/# /'
# 1
```
And at least 2 of 15 Ts:
```bash
grep -c "TTTTTTTTTTTTTTT" B_anthracis.fa | sed 's/^/# /'
# 2
```
*B. thuringiensis* ref genome has at least one stretch of 11 As in it:
```bash
grep -c "AAAAAAAAAAA" B_thuringiensis.fa | sed 's/^/# /'
# 1
```
And at least 1 of 11 Ts:
```bash
grep -c "TTTTTTTTTTT" B_thuringiensis.fa | sed 's/^/# /'
# 1
```
I suspect the only really problematic one here is the 36 As in the *B. anthracis* genome.
**Combining together, making bowtie2 db of both, and mapping**
```bash
cat B_anthracis.fa B_thuringiensis.fa > Bacillus-rep-genomes.fa
bowtie2-build Bacillus-rep-genomes.fa Bacillus-rep-genomes-index
bowtie2 -x Bacillus-rep-genomes-index -U CRR125950_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0.fq.gz -p 4 --al aligned-reads.fq --no-unal > /dev/null
# converting reads to fasta format
sed 's/^@/>/' aligned-reads.fq | paste - - - - | cut -f 1,2 | tr "\t" "\n" > aligned-reads.fa
```
26,425 sequences aligned to either of these based on bowtie2 defaults.
```bash
grep -c ">" aligned-reads.fa | sed 's/^/# /'
```
While that's still an order of magnitude lower than the 252,471 that were assigned to *B. anthracis* by bracken, bowtie2 is pretty stringent relatively speaking, so maybe this isn't a big deal yet. Let's see what the percent identities of the recruited reads look like.
#### SIDE NOTE
There is more to think about here, almost all of the seqs that successfuly mapped to the refs are extremely low-complexity.
20,499 of the total ~26,000 have 25 T's in a row:
```bash
grep -c "TTTTTTTTTTTTTTTTTTTTTTTTT" aligned-reads.fa | sed 's/^/# /'
# 20499
```
5,143 have 25 A's in a row:
```bash
grep -c "AAAAAAAAAAAAAAAAAAAAAAAAA" aligned-reads.fa | sed 's/^/# /'
# 5143
```
To be sure, something odd would still need to be going on with bracken (if there is a problem), but this is probably something we should filter out earlier, like after trimmomatic since I'm not seeing a way to do it with that. I'm looking at [fastp](https://github.com/OpenGene/fastp#low-complexity-filter) at the moment as an option. (With this in mind, I'm not sure how informative the following blasting approach matters, as we are mostly blasting uninformative seqs – unless we get very few hits because the uninformative ones don't align properly.)
#### BLASTing successfully mapped reads against genomes
```bash
makeblastdb -dbtype nucl -in Bacillus-rep-genomes.fa -out Bacillus-rep-genomes-BLAST-db
blastn -task dc-megablast -query aligned-reads.fa -db Bacillus-rep-genomes-BLAST-db -outfmt "6 qseqid qlen sseqid slen length qcovus pident evalue bitscore" -max_hsps 1 -max_target_seqs 1 | sort -nrk 8 > blast_out.tmp && cat <(printf "qseqid\tqlen\tsseqid\tslen\tlength\tqcovus\tpident\tevalue\tbitscore\n") blast_out.tmp > blast-out.tsv && rm blast_out.tmp
```
568 successfully aligned based on dc-megablast (probably so few due to the uninformative mass noted above):
```bash
tail -n +2 blast-out.tsv | wc -l | sed 's/^/# /'
# 568
```
181 to *B. anthracis*
```bash
grep "B_anthracis" blast-out.tsv | wc -l | sed 's/^/# /'
# 181
```
387 to *B. thuringiensis*
```bash
grep "B_thuringiensis" blast-out.tsv | wc -l | sed 's/^/# /'
# 387
```
**Looking at percent identities of each**
```bash
grep "B_anthracis" blast-out.tsv | awk -F $'\t' ' $6 > 50 ' | cut -f 7 > b-anthracis-blast-pidents.txt
grep "B_thuringiensis" blast-out.tsv | awk -F $'\t' ' $6 > 50 ' | cut -f 7 > b-thuringiensis-blast-pidents.txt
```
Getting summary of distributions from R:
```bash
cat > summary.R << 'EOF'
args = commandArgs(trailingOnly=TRUE)
vec = scan(args[1], quiet=TRUE)
summary(vec)
EOF
```
```bash
Rscript summary.R b-anthracis-blast-pidents.txt | sed 's/^/# /'
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 87.84 90.97 91.77 91.64 92.48 94.22
```
```bash
Rscript summary.R b-thuringiensis-blast-pidents.txt | sed 's/^/# /'
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 87.92 89.37 91.67 92.05 94.06 100.00
```
> Both look pretty similar in their distributions. With *B. thuringiensis* having more based on kraken2 and based on alignment.
### Visualizing recruitment in anvi'o
```
## anvio 6.2
anvi-gen-contigs-database -f Bacillus-rep-genomes.fa -n Bacillus_poking -o Bacillus-rep-genomes.db --skip-gene-calling
# converting, sorting, and indexing mapping file
samtools view -b aligned.sam | samtools sort -@ 4 > aligned.bam
samtools index -@ 4 aligned.bam
# profiling
# needed to jump to master to do this after --skip-gene-calling currently (https://github.com/merenlab/anvio/issues/1406)
anvi-profile -i aligned.bam -c Bacillus-rep-genomes.db -o aligned-profile -T 4
# back to anvi'o 6.2
# making items order file to be able to visualize without any coverage or TNF clustering
anvi-export-table --table splits_basic_info Bacillus-rep-genomes.db
tail -n +2 splits_basic_info.txt | cut -f 1 > items-order.txt
anvi-import-items-order -i items-order.txt -p aligned-profile/PROFILE.db --name arranged --make-default
# making collection to color the genomes
paste items-order.txt <( cut -f 1,2 -d "_" items-order.txt ) > collection.tsv
anvi-import-collection -c Bacillus-rep-genomes.db -p aligned-profile/PROFILE.db -C genomes collection.tsv
```
As seen [above](https://hackmd.io/KOYV61V7Q5mPm09ScZ6XsA?both#Reads-classified-as-B-anthracis-by-kraken2) by looking at the reads that were classified by kraken2 as *B. anthracis* and seeing they virtually all had giant homopolymer stretches of A/T, virtually all the recruitment is to one spot:
<a href="https://i.imgur.com/f3i3bBb.png"><img src="https://i.imgur.com/f3i3bBb.png"></a>
With 51X mean coverage, but only 0.3% [detection](http://merenlab.org/2017/05/08/anvio-views/#detection).
And if we look at that split more closely, we can see all reads are mapping to the one spot with the homopolymer in the ref genome:
<a href="https://i.imgur.com/XlUi805.png"><img src="https://i.imgur.com/XlUi805.png"></a>
<br>
Zoomed in:
<a href="https://i.imgur.com/hdVS6yE.png"><img src="https://i.imgur.com/hdVS6yE.png"></a>
Looking at the sequence:
<a href="https://i.imgur.com/CFSuraJ.png"><img src="https://i.imgur.com/CFSuraJ.png"></a>
---
**Neither *B. anthracis* nor *B. thuringiensis* is in this sample.**
<a href="https://i.imgur.com/ZhCbXyc.png"><img src="https://i.imgur.com/ZhCbXyc.png"></a>
---