This took about 20 minutes to set up as run below.
NOTE
The following was executed as below on 29-Nov-2020. The--no-masking
flag in included in thekraken2-build --download library
for the reasons discussed at the end of this page here.
Roughly following along with here.
Downloading human reference (takes ~2 minutes as run here):
Downloading NCBI taxonomy info needed (takes ~10 minutes):
Building database (takes ~20 minutes as run here):
Removing intermediate files:
3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following:
The following test fastq file can be downloaded to make a single-end or paired-end mock dataset if wanted. It holds 1 human read, 1 phiX read, and 1 E. coli read:
First, "masking" is the process of making it such that low-complexity regions aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called dustmasker that does this that's used by NCBI, kraken2, centrifuge, and others.
While working on some read-based classifier databases, running a sample that already had been run against a default-built kraken2 human database to remove human reads, I was getting a ton of human reads classified. I was working specifically with this file if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is after going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human.
So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on (including tracking down possible changes in the human reference fasta that was used to build the dbs), it all came down to this masking process. kraken2 does masking by default, and generally I agree that's a good idea. If we want to classify things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to identify and remove something like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome:
In both of those cases, given that we want to be conservative with regard to removing any possible human DNA before uploading data to GeneLab, I think we want to remove them. So for building a database like this that's sole purpose is to remove things that match the human reference genome, I think it's better to build the reference DB without masking.