# Human read-removal with kraken2
> This is how we do it in GeneLab too.
This took about 20 minutes to set up as run below, but there is a command to download the database as built on 30-June-2020 [below](https://hackmd.io/@astrobiomike/kraken2-human-read-removal#Download-database-as-built-on-30-June-2020) if wanted.
[toc]
## Installation with conda if needed
```bash
conda create -n kraken2 -c conda-forge -c bioconda -c defaults kraken2
conda activate kraken2
```
## Setting up kraken2 database with human genome only
> Generally following along with kraken2 manual [here](https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases)
Downloading human library (masks low complexity regions by default; takes like 5 minutes as run here):
```bash
mkdir kraken2_human_db
kraken2-build --download-library human --db kraken2_human_db/ --threads 4
```
Downloading NCBI tax info (takes like 5 minutes):
```bash
kraken2-build --download-taxonomy --db kraken2_human_db/
```
Building database (takes ~10 minutes as run here):
```bash
kraken2-build --build --db kraken2_human_db/ --threads 4
```
Removing intermediate files (saves a lot of space):
```bash
kraken2-build --clean --db kraken2_human_db/
```
## Download database as built on 30-June-2020
The database built on 30-June-2020 can be downloaded as follows if wanting to skip the building a new one above (which at the time utilized the RefSeq version of [GRCh38.p13](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)). It's 2.8GB compressed, 3.8 GB uncompressed.
```bash
curl -L -o kraken2_human_db.tar.gz https://ndownloader.figshare.com/files/23567780
tar -xzvf kraken2_human_db.tar.gz
```
## Example commands filtering out human reads
Getting tiny example data:
```bash
curl -L -o sample-1.fq.gz https://ndownloader.figshare.com/files/23567891
```
### Single-end example
Performing classification/filtering:
```bash
kraken2 --db kraken2_human_db/ --threads 4 \
--output sample-1-kraken2-out.txt \
--report sample-1-kraken2-report.txt \
--unclassified-out sample-1-filtered-reads.fq \
sample-1.fq.gz
# then compressing the output if wanted
gzip sample-1-filtered-reads.fq
```
### Paired-end example
Making mock paired-end data:
```bash
cp sample-1.fq.gz sample-1-R1.fq.gz
cp sample-1.fq.gz sample-1-R2.fq.gz
```
Performing classification/filtering:
```bash
kraken2 --paired --db kraken2_human_db/ --threads 4 \
--output sample-1-kraken2-out.txt \
--report sample-1-kraken2-report.txt \
--unclassified-out sample-1-filtered-reads-R#.fq \
sample-1-R1.fq.gz sample-1-R2.fq.gz
# compressing output reads if wanted
gzip sample-1-filtered-reads-R_1.fq
gzip sample-1-filtered-reads-R_2.fq
```
## Example tracking info on reads removed per sample if wanted
Here's one way we can make a table that has some info like percent reads removed from each sample. As written, it assumes output files named like above ({sample-ID}-kraken2-out.txt), and needs a single-column text file with the unique sample IDs as input. For an example, here's copying the output from the above to just make a second output for example purposes:
```bash
cp sample-1-kraken2-out.txt sample-2-kraken2-out.txt
cp sample-1-kraken2-report.txt sample-2-kraken2-report.txt
```
Making a sample list input file:
```bash
printf "sample-1\nsample-2\n" > sample-list.txt
```
Pasting and running this next codeblock will generate the bash script to do the summary:
```bash
cat << 'EOF' > summarizing-kraken2-read-removal.sh
#!/usr/bin/env bash
# making sure we aren't adding to one already started
rm -rf building.tmp
for SAMPLE_ID in $(cat $1);
do
total_fragments=$(wc -l ${SAMPLE_ID}-kraken2-out.txt | cut -f 1 -d " ")
fragments_retained=$(grep -w -m 1 "unclassified" ${SAMPLE_ID}-kraken2-report.txt | cut -f 2)
perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l) )
printf "${SAMPLE_ID}\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" >> building.tmp
done
# combining
cat <( printf "Sample_ID\tTotal_reads_before\tTotal_reads_after\tPercent_reads_removed\n" ) building.tmp > kraken2-read-removal-summary.tsv && rm building.tmp
EOF
```
Making script executable:
```bash
chmod +x summarizing-kraken2-read-removal.sh
```
Now that script can be put wherever we want, but the sample input list needs to point to where the kraken2 output files are stored. So probably easiest to run it in the directory that has the kraken2 output files, then that list can just be the file names, as in this example:
```bash
./summarizing-kraken2-read-removal.sh sample-list.txt
```
Which gives us this summary table written to `kraken2-read-removal-summary.tsv`:
```
Sample_ID Total_reads_before Total_reads_after Percent_reads_removed
sample-1 2 1 50.00
sample-2 2 1 50.00
```