This took about 20 minutes to set up as run below.
NOTE
This was updated on 11-Sept-2020, built the same way except for one change, adding the--no-masking
flag to thekraken2-build --download library
andkraken2-build --add-to-library
commands. The reason for this is discussed at the end of this page here.
Roughly following along with here.
Downloading NCBI taxonomy info needed (takes like 5 minutes):
Downloading human reference (takes ~1 minute as run here):
Downloading and adding phiX genome to this library:
Building database (takes ~7 minutes as run here):
Removing intermediate files:
3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following:
This is the older one that was built with masking, and I don't think we should use anymore. See below for why the change.
2.8GB compressed, ~4GB uncompressed. Can be downloaded with the following:
Getting tiny example data:
Performing filtering:
Here's one way we can make a table that has some info like percent reads removed from each sample. As written, it assumes output files named like above ({sample-ID}-kraken2-out.txt), and would need a single-column text file with the unique sample IDs as input. For an example, here's copying the output from the above to just make a second output for example purposes:
Making a sample list input file:
Pasting and running this next codeblock will generate the bash script to do the summary:
Making script executable:
Now that script can be put wherever we want, but the sample input list needs to point to where the kraken2 output files are stored. So probably easiest to run it in the directory that has the kraken2 output files, then that list can just be the file names, as in this example:
Which gives us this summary table:
First, "masking" is the process of masking low-complexity regions so they aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called dustmasker that does this that's used by NCBI, kraken2, centrifuge, and others.
While working on some read-based classifier databases, running one of our samples that already had been run against the original 19-June-2020 human and phiX kraken2 db to remove human reads, I was getting a ton of human reads classified. I was working specifically with this file if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is after going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human.
So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on, it all came down to this masking process. kraken2 does masking by default, and generally it's a good idea. If we want to classify things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to identify and remove something like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome:
In both of those cases, we don't need or want them. So for building a database like this that's sole purpose is to remove things that match the human or phiX genome, I think it's better to build the reference DB without masking.