Human read-removal with kraken2

# Human read-removal with kraken2 > This is how we do it in GeneLab too. This took about 20 minutes to set up as run below, but there is a command to download the database as built on 30-June-2020 [below](https://hackmd.io/@astrobiomike/kraken2-human-read-removal#Download-database-as-built-on-30-June-2020) if wanted. [toc] ## Installation with conda if needed ```bash conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2 conda activate kraken2 ``` ## Setting up kraken2 database with human genome only > Generally following along with kraken2 manual [here](https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases) Downloading human library (masks low complexity regions by default; takes like 5 minutes as run here): ```bash mkdir kraken2_human_db kraken2-build --download-library human --db kraken2_human_db/ --threads 4 ``` Downloading NCBI tax info (takes like 5 minutes): ```bash kraken2-build --download-taxonomy --db kraken2_human_db/ ``` Building database (takes ~10 minutes as run here): ```bash kraken2-build --build --db kraken2_human_db/ --threads 4 ``` Removing intermediate files (saves a lot of space): ```bash kraken2-build --clean --db kraken2_human_db/ ``` ## Download database as built on 30-June-2020 The database built on 30-June-2020 can be downloaded as follows if wanting to skip the building a new one above (which at the time utilized the RefSeq version of [GRCh38.p13](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)). It's 2.8GB compressed, 3.8 GB uncompressed. ```bash curl -L -o kraken2_human_db.tar.gz https://ndownloader.figshare.com/files/23567780 tar -xzvf kraken2_human_db.tar.gz ``` ## Example commands filtering out human reads Getting tiny example data: ```bash curl -L -o sample-1.fq.gz https://ndownloader.figshare.com/files/23567891 ``` ### Single-end example Performing classification/filtering: ```bash kraken2 --db kraken2_human_db/ --threads 4 \ --output sample-1-kraken2-out.txt \ --report sample-1-kraken2-report.txt \ --unclassified-out sample-1-filtered-reads.fq \ sample-1.fq.gz # then compressing the output if wanted gzip sample-1-filtered-reads.fq ``` ### Paired-end example Making mock paired-end data: ```bash cp sample-1.fq.gz sample-1-R1.fq.gz cp sample-1.fq.gz sample-1-R2.fq.gz ``` Performing classification/filtering: ```bash kraken2 --paired --db kraken2_human_db/ --threads 4 \ --output sample-1-kraken2-out.txt \ --report sample-1-kraken2-report.txt \ --unclassified-out sample-1-filtered-reads-R#.fq \ sample-1-R1.fq.gz sample-1-R2.fq.gz # compressing output reads if wanted gzip sample-1-filtered-reads-R_1.fq gzip sample-1-filtered-reads-R_2.fq ``` ## Example tracking info on reads removed per sample if wanted Here's one way we can make a table that has some info like percent reads removed from each sample. As written, it assumes output files named like above ({sample-ID}-kraken2-out.txt), and needs a single-column text file with the unique sample IDs as input. For an example, here's copying the output from the above to just make a second output for example purposes: ```bash cp sample-1-kraken2-out.txt sample-2-kraken2-out.txt cp sample-1-kraken2-report.txt sample-2-kraken2-report.txt ``` Making a sample list input file: ```bash printf "sample-1\nsample-2\n" > sample-list.txt ``` Pasting and running this next codeblock will generate the bash script to do the summary: ```bash cat << 'EOF' > summarizing-kraken2-read-removal.sh #!/usr/bin/env bash # making sure we aren't adding to one already started rm -rf building.tmp for SAMPLE_ID in $(cat $1); do total_fragments=$(wc -l ${SAMPLE_ID}-kraken2-out.txt | cut -f 1 -d " ") fragments_retained=$(grep -w -m 1 "unclassified" ${SAMPLE_ID}-kraken2-report.txt | cut -f 2) perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l) ) printf "${SAMPLE_ID}\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" >> building.tmp done # combining cat <( printf "Sample_ID\tTotal_reads_before\tTotal_reads_after\tPercent_reads_removed\n" ) building.tmp > kraken2-read-removal-summary.tsv && rm building.tmp EOF ``` Making script executable: ```bash chmod +x summarizing-kraken2-read-removal.sh ``` Now that script can be put wherever we want, but the sample input list needs to point to where the kraken2 output files are stored. So probably easiest to run it in the directory that has the kraken2 output files, then that list can just be the file names, as in this example: ```bash ./summarizing-kraken2-read-removal.sh sample-list.txt ``` Which gives us this summary table written to `kraken2-read-removal-summary.tsv`: ``` Sample_ID Total_reads_before Total_reads_after Percent_reads_removed sample-1 2 1 50.00 sample-2 2 1 50.00 ```