--- tags: cov-irt --- # Human and phiX read-removal with kraken2 This took about 20 minutes to set up as run below. [toc] ## Creating conda environment ```bash conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2=2.0.9beta conda activate kraken2 ``` ## Setting up kraken2 database with human and phiX genomes --- > **NOTE** > This was updated on 11-Sept-2020, built the same way except for one change, adding the `--no-masking` flag to the `kraken2-build --download library` and `kraken2-build --add-to-library` commands. The reason for this is discussed at the end of this page [here](https://hackmd.io/Qd0XKEsUR1ebOnOPubqbrg#Why-building-with-no-masking). Roughly following along with [here](https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases). Downloading NCBI taxonomy info needed (takes like 5 minutes): ```bash mkdir kraken2_human_and_phiX_db kraken2-build --download-taxonomy --db kraken2_human_and_phiX_db/ ``` Downloading human reference (takes ~1 minute as run here): ```bash kraken2-build --download-library human --db kraken2_human_and_phiX_db/ \ --threads 40 --no-masking ``` Downloading and adding phiX genome to this library: ```bash curl -L https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz | gunzip > phiX.fa kraken2-build --add-to-library phiX.fa --db kraken2_human_and_phiX_db/ --no-masking # removing genome fasta rm phiX.fa ``` Building database (takes ~7 minutes as run here): ```bash kraken2-build --build --db kraken2_human_and_phiX_db/ --threads 40 ``` Removing intermediate files: ```bash kraken2-build --clean --db kraken2_human_and_phiX_db/ ``` ### Download database as built on 11-Sept-2020 (LATEST) 3GB compressed, ~4.3GB uncompressed. Can be downloaded with the following: ```bash curl -L -o kraken2_human_and_phiX_db.tar.gz https://ndownloader.figshare.com/files/24658262 tar -xzvf kraken2_human_and_phiX_db.tar.gz ``` ### Download database as built on 19-June-2020 > This is the older one that was built with masking, and I don't think we should use anymore. See [below](https://hackmd.io/Qd0XKEsUR1ebOnOPubqbrg#Why-building-with-no-masking) for why the change. 2.8GB compressed, ~4GB uncompressed. Can be downloaded with the following: ```bash #### NOTE: the date is on the tar.gz file, but will not be part of the directory name once it is unpacked curl -L -o kraken2_human_and_phiX_db_19_June_2020.tar.gz https://ndownloader.figshare.com/files/23276654 tar -xzvf kraken2_human_and_phiX_db_19_June_2020.tar.gz ``` ## Example command filtering out human and phiX reads Getting tiny example data: ```bash curl -L -o sample-1.fq.gz https://ndownloader.figshare.com/files/23237460 ``` Performing filtering: ```bash kraken2 --db kraken2_human_and_phiX_db/ --threads 20 \ --output sample-1-kraken2-out.txt --report sample-1-kraken2-report.txt \ --unclassified-out sample-1-filtered-reads.fq sample-1.fq.gz # then compressing the output if wanted gzip sample-1-filtered-reads.fq ``` ## Tracking info on reads removed per sample Here's one way we can make a table that has some info like percent reads removed from each sample. As written, it assumes output files named like above ({sample-ID}-kraken2-out.txt), and would need a single-column text file with the unique sample IDs as input. For an example, here's copying the output from the above to just make a second output for example purposes: ```bash cp sample-1-kraken2-out.txt sample-2-kraken2-out.txt cp sample-1-kraken2-report.txt sample-2-kraken2-report.txt ``` Making a sample list input file: ```bash printf "sample-1\nsample-2\n" > sample-list.txt ``` Pasting and running this next codeblock will generate the bash script to do the summary: ```bash cat << 'EOF' > summarizing-kraken2-read-removal.sh #!/usr/bin/env bash # making sure we aren't adding to one already started rm -rf building.tmp for SAMPLE_ID in $(cat $1); do total_fragments=$(wc -l ${SAMPLE_ID}-kraken2-out.txt | cut -f 1 -d " ") fragments_retained=$(grep -w -m 1 "unclassified" ${SAMPLE_ID}-kraken2-report.txt | cut -f 2) perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l) ) printf "${SAMPLE_ID}\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" >> building.tmp done # combining cat <( printf "Sample_ID\tTotal_reads_before\tTotal_reads_after\tPercent_reads_removed\n" ) building.tmp > kraken2-read-removal-summary.tsv && rm building.tmp EOF ``` Making script executable: ```bash chmod +x summarizing-kraken2-read-removal.sh ``` Now that script can be put wherever we want, but the sample input list needs to point to where the kraken2 output files are stored. So probably easiest to run it in the directory that has the kraken2 output files, then that list can just be the file names, as in this example: ```bash ./summarizing-kraken2-read-removal.sh sample-list.txt ``` Which gives us this summary table: ``` Sample_ID Total_reads_before Total_reads_after Percent_reads_removed sample-1 2 1 50.00 sample-2 2 1 50.00 ``` --- # Why build with no masking? First, "masking" is the process of masking low-complexity regions so they aren't seen/considered by whatever process we are going to put them through. There is a commonly used program called [dustmasker](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/) that does this that's used by NCBI, kraken2, centrifuge, and others. While working on some read-based classifier databases, running one of our samples that already had been run against the original 19-June-2020 human and phiX kraken2 db to remove human reads, I was getting a ton of human reads classified. I was working specifically with [this file](https://osf.io/hg7p4/) if curious, and about 1 million out of 2 million reads were getting classified as human – again, this is *after* going through the human-read removal process. This was happening with both centrifuge and a different kraken2 database I had access to (that both held human, bacteria, archaea, and fungi), and when I pulled out the reads getting classified as human and ran them through BLAST, they sure enough were coming back clearly human. So, after driving myself crazy for quite a long, wonderful time trying to figure out WTF was going on, it all came down to this masking process. kraken2 does masking by default, and generally it's a good idea. If we want to *classify* things, I'd use masking to build that database. In that case we don't want ambiguous regions giving us false positives. But if we want to *identify and remove something* like is the case here, I don't think masking is helpful. Aside from these being inherently low-information sequences by definition, there are only two possibilities for reads that would match to an un-masked, low-complexity fragment of the human genome: 1. They only match human, in which case we would want them removed anyway 2. They are ambiguous enough that they came from something else but they still match at least the human genome too, and potentially other things, in which case we wouldn't want to trust their classification and would want them removed anyway In both of those cases, we don't need or want them. So for building a database like this that's sole purpose is to remove things that match the human or phiX genome, I think it's better to build the reference DB without masking.