---
tags: GeneLab
---
# Human read-removal with kraken2
This took about 20 minutes to set up as run below.
[toc]
## Creating conda environment
```bash
conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.1.1
conda activate kraken2
```
## Setting up kraken2 database with human genome
---
> **NOTE**
> The following was executed as below on 29-Nov-2020. The `--no-masking` flag in included in the `kraken2-build --download library` for the reasons discussed at the end of this page [here](https://hackmd.io/X4NIcY-hQTyuQiOY0-UFKA#Why-build-with-no-masking).
---
Roughly following along with [here](https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases).
**Downloading human reference (takes ~2 minutes as run here):**
```bash
kraken2-build --download-library human --db kraken2-human-db --threads 30 --no-masking
```
**Downloading NCBI taxonomy info needed (takes ~10 minutes):**
```bash
kraken2-build --download-taxonomy --db kraken2-human-db/
```
**Building database (takes ~20 minutes as run here):**
```bash
kraken2-build --build --db kraken2-human-db/ --threads 30
```
**Removing intermediate files:**
```bash
kraken2-build --clean --db kraken2-human-db/
```
### Download database as built on 29-Nov-2020
3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following:
```bash
curl -L -o kraken2-human-db.tar.gz https://ndownloader.figshare.com/files/25627058
tar -xzvf kraken2-human-db.tar.gz
```
---
## Tiny test fastqs
The following test fastq file can be downloaded to make a single-end or paired-end mock dataset if wanted. It holds 1 human read, 1 phiX read, and 1 E. coli read:
```bash
curl -L -o sample-A-R1.fq.gz https://ndownloader.figshare.com/files/25627406
curl -L -o sample-A-R2.fq.gz https://ndownloader.figshare.com/files/25627406
```
---
# Why build with no masking?
First, "masking" is the process of making it such that low-complexity regions aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called [dustmasker](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/) that does this that's used by NCBI, kraken2, centrifuge, and others.
While working on some read-based classifier databases, running a sample that already had been run against a default-built kraken2 human database to remove human reads, I was getting a ton of human reads classified. I was working specifically with [this file](https://osf.io/hg7p4/) if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is *after* going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human.
So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on (including tracking down possible changes in the human reference fasta that was used to build the dbs), it all came down to this masking process. kraken2 *does* masking by default, and generally I agree that's a good idea. If we want to *classify* things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to *identify and remove something* like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome:
1. They only match human, in which case we would want them removed anyway
2. They are ambiguous enough that they came from something else but they still match at least the human genome too
In both of those cases, given that we want to be conservative with regard to removing any possible human DNA before uploading data to GeneLab, I think we want to remove them. So for building a database like this that's sole purpose is to remove things that match the human reference genome, I think it's better to build the reference DB without masking.