Try   HackMD

Human read-removal with kraken2

This took about 20 minutes to set up as run below.

Creating conda environment

conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.1.1

conda activate kraken2

Setting up kraken2 database with human genome


NOTE
The following was executed as below on 29-Nov-2020. The --no-masking flag in included in the kraken2-build --download library for the reasons discussed at the end of this page here.


Roughly following along with here.

Downloading human reference (takes ~2 minutes as run here):

kraken2-build --download-library human --db kraken2-human-db --threads 30 --no-masking

Downloading NCBI taxonomy info needed (takes ~10 minutes):

kraken2-build --download-taxonomy --db kraken2-human-db/

Building database (takes ~20 minutes as run here):

kraken2-build --build --db kraken2-human-db/ --threads 30

Removing intermediate files:

kraken2-build --clean --db kraken2-human-db/

Download database as built on 29-Nov-2020

3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following:

curl -L -o kraken2-human-db.tar.gz https://ndownloader.figshare.com/files/25627058

tar -xzvf kraken2-human-db.tar.gz

Tiny test fastqs

The following test fastq file can be downloaded to make a single-end or paired-end mock dataset if wanted. It holds 1 human read, 1 phiX read, and 1 E. coli read:

curl -L -o sample-A-R1.fq.gz https://ndownloader.figshare.com/files/25627406
curl -L -o sample-A-R2.fq.gz https://ndownloader.figshare.com/files/25627406

Why build with no masking?

First, "masking" is the process of making it such that low-complexity regions aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called dustmasker that does this that's used by NCBI, kraken2, centrifuge, and others.

While working on some read-based classifier databases, running a sample that already had been run against a default-built kraken2 human database to remove human reads, I was getting a ton of human reads classified. I was working specifically with this file if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is after going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human.

So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on (including tracking down possible changes in the human reference fasta that was used to build the dbs), it all came down to this masking process. kraken2 does masking by default, and generally I agree that's a good idea. If we want to classify things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to identify and remove something like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome:

  1. They only match human, in which case we would want them removed anyway
  2. They are ambiguous enough that they came from something else but they still match at least the human genome too

In both of those cases, given that we want to be conservative with regard to removing any possible human DNA before uploading data to GeneLab, I think we want to remove them. So for building a database like this that's sole purpose is to remove things that match the human reference genome, I think it's better to build the reference DB without masking.