Trimming and SNPs

# Trimming and SNPs 9/8/20 --- https://datacarpentry.org/wrangling-genomics :::info This workshop is designed to run on AWS, but we're using it as applied to the CRC. You may see some instructions throughout that won't work on the CRC, particularly software. We have the same software installed on the CRC, but you need to run module load bio/2.0 to have access to them. ::: on CRC /tmp/ND_ICG_SEP_08/ copy files to new working directory data/ ``` mkdir data cd data cp /tmp/ND_ICG_SEP_08/* . mkdir untrimmed_fastq mv SRR* untrimmed_fastq/ cd untrimmed_fastq/ cd ../../ mkdir bin mv data/untrimmed_fastq/validateFiles bin/ ./bin/validateFiles -type=fastq data/untrimmed_fastq/* # Error count 0 <- output if successful cd data/untrimmed_fastq ``` Decompress files ``` gunzip *.gz #multiple ways to unzip, also gzip -d <file> or zcat <file.gz> >> <file> ``` Validate files again ``` ../../bin/validateFiles -type=fastq ./*.fastq ##class recording starts here module load bio/2.0 fastqc *.fastq cp /tmp/ND_ICG_SEP_08/NexteraPE-PE.txt . fastqc -a NexteraPE-PE.txt *.fastq # can also run with single fastq file by specifying name of file ``` To look at fastqc results, exit CRC/return to local machine, transfer fastqc output to local machine ``` scp <netid>@crcfe01.crc.nd.edu:ND_ICG_2020/data/*.html . # path after colon should direct to wherever your fastqc output is # period in last part of command indicates you want the files transferred # to your current working directory - this can be any directory path ``` Open html file in your favorite browser and examine results Connect back to CRC Move back to working directory for this lesson and do some cleanup ``` mkdir fastqc_untrimmed_reads mv untrimmed_fastq/*.html fastqc_untrimmed_reads mv untrimmed_fastq/*.zip fastqc_untrimmed_reads ``` ## Trimmomatic ``` module load bio/2.0 # working directory should be where data is mkdir trimmed_fastq trimmomatic PE untrimmed_fastq/SRR2589044_1.fastq.gz untrimmed_fastq/SRR2589044_2.fastq.gz \ trimmed_fastq/SRR2589044_1.trim.fastq.gz trimmed_fastq/SRR2589044_1.un.fastq.gz \ trimmed_fastq/SRR2589044_2.trim.fastq.gz trimmed_fastq/SRR2589044_2.un.fastq.gz \ SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 ``` Create for loop to do all files at once (updated 9/10/20 at 3:30pm) ``` for infile in untrimmed_fastq/*_1.fastq.gz do base=$(basename ${infile} _1.fastq.gz \ trimmomatic PE ${infile} untrimmed_fastq/${base}_2.fastq.gz \ trimmed_fastq/${base}_1.trim.fastq.gz trimmed_fastq/${base}_1.un.trim.fastq.gz \ trimmed_fastq/${base}_2.trim.fastq.gz trimmed_fastq/${base}_2.un.trim.fastq.gz \ SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 done mkdir ref_genome cd ref_genome cp /tmp/ND_ICG_SEP_08/ecoli_rel606.fasta . bwa index ecoli_rel606.fasta #contents of ref_genome when completed successfully: ls #ecoli_rel606.fasta ecoli_rel606.fasta.ann ecoli_rel606.fasta.pac #ecoli_rel606.fasta.amb ecoli_rel606.fasta.bwt ecoli_rel606.fasta.sa ``` Notation in above script: https://wiki.bash-hackers.org/syntax/expansion/cmdsubst Basename command: https://www.computerhope.com/unix/ubasenam.htm ### Class break - resuming on 9/15 ``` module load bio/2.0 mkdir -p results/sam_results/bam_results/vcf_results curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248 tar xvf sub.tar.gz mv sub trimmed_fastq_small cd ../ # how to run bwa mem (actual alignment) bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam # how I ran bwa mem bwa mem ref_genome/ecoli_rel606.fasta head SRR2584866.aligned.sam #.sam is readable on the command line, while .bam is not (written in binary) samtools view -S -b results/sam/SRR2584866.aligned.sam >results/bam/SRR2584866.aligned.bam samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam samtools flagstat results/bam/SRR2584866.aligned.sorted.bam ``` Now we can start calculating SNPS and finding their location ``` bcftools mpileup -0 b -o results/bcf/SRR2584866_raw_bcf -f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam bcftools call -ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf vcfutils.pl varFilter results/bcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf less results/vcf/SRR2584866_final_variants.vcf #look at results, use 'q' to exit less and return to command line ``` #### Use IGV to view SNPs Open browser on local machine and log in to CRC at address crcfe01.crc.nd.edu (enter netid and password) Start fastx session Open terminal (in activites) ``` module load bio/2.0 igv ``` Use the IGV gui to: -Load our reference genome file (ecoli_rel606.fasta) into IGV using the “Load Genomes from File…“ option under the “Genomes” pull-down menu. -Load our BAM file (SRR2584866.aligned.sorted.bam) using the “Load from File…“ option under the “File” pull-down menu. -Do the same with our VCF file (SRR2584866_final_variants.vcf). Refer to the Data Carpentries tutorial for screenshots of what this should look like. You should be able to zoom in and out of the genome at interesting points (locations where there are large concentrations of SNPs) Continuing to new lesson - [Variant Calling in Julia](https://hackmd.io/@NFpEogXySTuWExLvDQQHig/HkT3OdCND) ## External Links Difference between FASTQ and FASTQ: https://compgenomr.github.io/book/fasta-and-fastq-formats.html (also description of this and ASCII symbols in the data carpentry link) Rsync: https://linux.die.net/man/1/rsync What is indexing a genome?: http://blog.avadis-ngs.com/2012/04/elegant-exact-string-match-using-bwt-2/ -underlying algorithm is Burrows-Wheeler Transformation