Try   HackMD

Human and phiX read-removal with kraken2

This took about 20 minutes to set up as run below.

Creating conda environment

conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta

conda activate kraken2

Setting up kraken2 database with human and phiX genomes


NOTE
This was updated on 11-Sept-2020, built the same way except for one change, adding the --no-masking flag to the kraken2-build --download library and kraken2-build --add-to-library commands. The reason for this is discussed at the end of this page here.

Roughly following along with here.

Downloading NCBI taxonomy info needed (takes like 5 minutes):

mkdir kraken2_human_and_phiX_db
kraken2-build --download-taxonomy --db kraken2_human_and_phiX_db/

Downloading human reference (takes ~1 minute as run here):

kraken2-build --download-library human --db kraken2_human_and_phiX_db/ \
              --threads 40 --no-masking

Downloading and adding phiX genome to this library:

curl -L https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz | gunzip > phiX.fa

kraken2-build --add-to-library phiX.fa --db kraken2_human_and_phiX_db/ --no-masking

# removing genome fasta
rm phiX.fa

Building database (takes ~7 minutes as run here):

kraken2-build --build --db kraken2_human_and_phiX_db/ --threads 40

Removing intermediate files:

kraken2-build --clean --db kraken2_human_and_phiX_db/

Download database as built on 11-Sept-2020 (LATEST)

3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following:

curl -L -o kraken2_human_and_phiX_db.tar.gz https://ndownloader.figshare.com/files/24658262

tar -xzvf kraken2_human_and_phiX_db.tar.gz

Download database as built on 19-June-2020

This is the older one that was built with masking, and I don't think we should use anymore. See below for why the change.

2.8GB compressed, ~4GB uncompressed. Can be downloaded with the following:

  #### NOTE: the date is on the tar.gz file, but will not be part of the directory name once it is unpacked
curl -L -o kraken2_human_and_phiX_db_19_June_2020.tar.gz https://ndownloader.figshare.com/files/23276654

tar -xzvf kraken2_human_and_phiX_db_19_June_2020.tar.gz

Example command filtering out human and phiX reads

Getting tiny example data:

curl -L -o sample-1.fq.gz https://ndownloader.figshare.com/files/23237460

Performing filtering:

kraken2 --db kraken2_human_and_phiX_db/ --threads 20 \
        --output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \
        --unclassified-out sample-1-filtered-reads.fq sample-1.fq.gz
        
# then compressing the output if wanted
gzip sample-1-filtered-reads.fq

Tracking info on reads removed per sample

Here's one way we can make a table that has some info like percent reads removed from each sample. As written, it assumes output files named like above ({sample-ID}-kraken2-out.txt), and would need a single-column text file with the unique sample IDs as input. For an example, here's copying the output from the above to just make a second output for example purposes:

cp sample-1-kraken2-out.txt sample-2-kraken2-out.txt
cp sample-1-kraken2-report.txt sample-2-kraken2-report.txt

Making a sample list input file:

printf "sample-1\nsample-2\n" > sample-list.txt

Pasting and running this next codeblock will generate the bash script to do the summary:

cat << 'EOF' > summarizing-kraken2-read-removal.sh
#!/usr/bin/env bash

# making sure we aren't adding to one already started
rm -rf building.tmp

for SAMPLE_ID in $(cat $1);
do

    total_fragments=$(wc -l ${SAMPLE_ID}-kraken2-out.txt | cut -f 1 -d " ")

    fragments_retained=$(grep -w -m 1 "unclassified" ${SAMPLE_ID}-kraken2-report.txt | cut -f 2)

    perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l) )

    printf "${SAMPLE_ID}\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" >> building.tmp

done

# combining
cat <( printf "Sample_ID\tTotal_reads_before\tTotal_reads_after\tPercent_reads_removed\n" ) building.tmp > kraken2-read-removal-summary.tsv && rm building.tmp

EOF

Making script executable:

chmod +x summarizing-kraken2-read-removal.sh

Now that script can be put wherever we want, but the sample input list needs to point to where the kraken2 output files are stored. So probably easiest to run it in the directory that has the kraken2 output files, then that list can just be the file names, as in this example:

./summarizing-kraken2-read-removal.sh sample-list.txt

Which gives us this summary table:

Sample_ID  Total_reads_before  Total_reads_after  Percent_reads_removed
sample-1   2                   1                  50.00
sample-2   2                   1                  50.00

Why build with no masking?

First, "masking" is the process of masking low-complexity regions so they aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called dustmasker that does this that's used by NCBI, kraken2, centrifuge, and others.

While working on some read-based classifier databases, running one of our samples that already had been run against the original 19-June-2020 human and phiX kraken2 db to remove human reads, I was getting a ton of human reads classified. I was working specifically with this file if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is after going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human.

So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on, it all came down to this masking process. kraken2 does masking by default, and generally it's a good idea. If we want to classify things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to identify and remove something like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome:

  1. They only match human, in which case we would want them removed anyway
  2. They are ambiguous enough that they came from something else but they still match at least the human genome too, and potentially other things, in which case we wouldn't want to trust their classification and would want them removed anyway

In both of those cases, we don't need or want them. So for building a database like this that's sole purpose is to remove things that match the human or phiX genome, I think it's better to build the reference DB without masking.