Trimming fastq reads

## Trimming fastq reads ### Trimmomatic You can use the following command to trim your reads. this will remove any potential (remember to change the names of files to those you are using!) [trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) This tutorial run on the dataset downloaded from NCBI SRA SRR11277344 (PE reads from Amplicon data) ``` java -jar /trimmomatic-0.36/trimmomatic-0.36.jar PE SRR11277344_pass_1.fastq.gz SRR11277344_pass_2.fastq.gz SRR11277344_pass_paired_1.fastq.gz SRR11277344_pass_unpaired_1.fastq.gz SRR11277344_pass_paired_2.fastq.gz SRR11277344_pass_unpaired_2.fastq.gz ILLUMINACLIP:/trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 ``` It will generate FOUR output files: 1&2 paired reads, and 1&2 unpaired reads *It is very important the order of output files! SRR*_1_paired.fastq.gz SRR*_1_unpaired.fastq.gz SRR*_2_paired.fastq.gz SRR*_2_unpaired.fastq.gz in this case, the files are identical to those we were using (we downloaded reads that had already been fitlered for QC.) to avoid taking to much space in the disk, lets remove the initial set of reads we downloaded ``` rm *_pass_1.fastq.gz rm *_pass_2.fastq.gz rm *_unpaired_* ``` Remember that `rm` will remove files FOREVER (If you are working on a cluster). To check if you have the right list of files replace `rm` by `echo`. It will show you the list of files without removing them. ``` echo *_pass_1.fastq.gz echo *_pass_2.fastq.gz echo *_unpaired_* ``` Of course, if you were analyzing a relatively long set of samples, you would not be doing this one by one! We can make loops for this process too! ### Alternative to trimming reads: fastp In /groups/fr2020/bin/fastp is installed another fastq trimming tool. We are going to use it for Bioproject PRJNA680161, since it contains Nextera adaptor sequences too (along with Illumina). Go to the working dir: `/bin/fastp/fastp -i SRR*_pass_1.fastq.gz -I SRR*_pass_2.fastq.gz -o SRR*_pass_1_clean.fastq.gz -O SRR*_pass_2_clean.fastq.gz -w 2 -j SRR*.json -h SRR*.html` -w number of CPUs -j & -h name of report