# Trimming and SNPs
9/8/20
---
https://datacarpentry.org/wrangling-genomics
:::info
This workshop is designed to run on AWS, but we're using it as applied to the CRC. You may see some instructions throughout that won't work on the CRC, particularly software. We have the same software installed on the CRC, but you need to run module load bio/2.0 to have access to them.
:::
on CRC
/tmp/ND_ICG_SEP_08/
copy files to new working directory data/
```
mkdir data
cd data
cp /tmp/ND_ICG_SEP_08/* .
mkdir untrimmed_fastq
mv SRR* untrimmed_fastq/
cd untrimmed_fastq/
cd ../../
mkdir bin
mv data/untrimmed_fastq/validateFiles bin/
./bin/validateFiles -type=fastq data/untrimmed_fastq/*
# Error count 0 <- output if successful
cd data/untrimmed_fastq
```
Decompress files
```
gunzip *.gz
#multiple ways to unzip, also gzip -d <file> or zcat <file.gz> >> <file>
```
Validate files again
```
../../bin/validateFiles -type=fastq ./*.fastq
##class recording starts here
module load bio/2.0
fastqc *.fastq
cp /tmp/ND_ICG_SEP_08/NexteraPE-PE.txt .
fastqc -a NexteraPE-PE.txt *.fastq # can also run with single fastq file by specifying name of file
```
To look at fastqc results, exit CRC/return to local machine, transfer fastqc output to local machine
```
scp <netid>@crcfe01.crc.nd.edu:ND_ICG_2020/data/*.html .
# path after colon should direct to wherever your fastqc output is
# period in last part of command indicates you want the files transferred
# to your current working directory - this can be any directory path
```
Open html file in your favorite browser and examine results
Connect back to CRC
Move back to working directory for this lesson and do some cleanup
```
mkdir fastqc_untrimmed_reads
mv untrimmed_fastq/*.html fastqc_untrimmed_reads
mv untrimmed_fastq/*.zip fastqc_untrimmed_reads
```
## Trimmomatic
```
module load bio/2.0
# working directory should be where data is
mkdir trimmed_fastq
trimmomatic PE untrimmed_fastq/SRR2589044_1.fastq.gz untrimmed_fastq/SRR2589044_2.fastq.gz \
trimmed_fastq/SRR2589044_1.trim.fastq.gz trimmed_fastq/SRR2589044_1.un.fastq.gz \
trimmed_fastq/SRR2589044_2.trim.fastq.gz trimmed_fastq/SRR2589044_2.un.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
```
Create for loop to do all files at once
(updated 9/10/20 at 3:30pm)
```
for infile in untrimmed_fastq/*_1.fastq.gz
do
base=$(basename ${infile} _1.fastq.gz \
trimmomatic PE ${infile} untrimmed_fastq/${base}_2.fastq.gz \
trimmed_fastq/${base}_1.trim.fastq.gz trimmed_fastq/${base}_1.un.trim.fastq.gz \
trimmed_fastq/${base}_2.trim.fastq.gz trimmed_fastq/${base}_2.un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
done
mkdir ref_genome
cd ref_genome
cp /tmp/ND_ICG_SEP_08/ecoli_rel606.fasta .
bwa index ecoli_rel606.fasta
#contents of ref_genome when completed successfully:
ls
#ecoli_rel606.fasta ecoli_rel606.fasta.ann ecoli_rel606.fasta.pac
#ecoli_rel606.fasta.amb ecoli_rel606.fasta.bwt ecoli_rel606.fasta.sa
```
Notation in above script: https://wiki.bash-hackers.org/syntax/expansion/cmdsubst
Basename command: https://www.computerhope.com/unix/ubasenam.htm
### Class break - resuming on 9/15
```
module load bio/2.0
mkdir -p results/sam_results/bam_results/vcf_results
curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
tar xvf sub.tar.gz
mv sub trimmed_fastq_small
cd ../
# how to run bwa mem (actual alignment)
bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
# how I ran bwa mem
bwa mem ref_genome/ecoli_rel606.fasta
head SRR2584866.aligned.sam #.sam is readable on the command line, while .bam is not (written in binary)
samtools view -S -b results/sam/SRR2584866.aligned.sam >results/bam/SRR2584866.aligned.bam
samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
samtools flagstat results/bam/SRR2584866.aligned.sorted.bam
```
Now we can start calculating SNPS and finding their location
```
bcftools mpileup -0 b -o results/bcf/SRR2584866_raw_bcf -f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
bcftools call -ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
vcfutils.pl varFilter results/bcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
less results/vcf/SRR2584866_final_variants.vcf #look at results, use 'q' to exit less and return to command line
```
#### Use IGV to view SNPs
Open browser on local machine and log in to CRC at address crcfe01.crc.nd.edu (enter netid and password)
Start fastx session
Open terminal (in activites)
```
module load bio/2.0
igv
```
Use the IGV gui to:
-Load our reference genome file (ecoli_rel606.fasta) into IGV using the “Load Genomes from File…“ option under the “Genomes” pull-down menu.
-Load our BAM file (SRR2584866.aligned.sorted.bam) using the “Load from File…“ option under the “File” pull-down menu.
-Do the same with our VCF file (SRR2584866_final_variants.vcf).
Refer to the Data Carpentries tutorial for screenshots of what this should look like. You should be able to zoom in and out of the genome at interesting points (locations where there are large concentrations of SNPs)
Continuing to new lesson - [Variant Calling in Julia](https://hackmd.io/@NFpEogXySTuWExLvDQQHig/HkT3OdCND)
## External Links
Difference between FASTQ and FASTQ: https://compgenomr.github.io/book/fasta-and-fastq-formats.html
(also description of this and ASCII symbols in the data carpentry link)
Rsync: https://linux.die.net/man/1/rsync
What is indexing a genome?: http://blog.avadis-ngs.com/2012/04/elegant-exact-string-match-using-bwt-2/
-underlying algorithm is Burrows-Wheeler Transformation