# Trimming, mapping and variant calling from horse FastQ files
Sam's quickstart messy guide to trimming, mapping, and variant calling in the horse🤯

# Starting out
Login to Farm:
```
ssh -i /Users/Samantha/.ssh/id_rsa_farm vanburen@farm.hpc.ucdavis.edu
```
Allocate resources:
```
srun -p high2 --time=2:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 1 --mem 5GB --pty /bin/bash
```
Go to the directory where the files are:
```
cd /group/ctbrowngrp/finnolab_shared/fastq/FAANG/WGS
```
Copy them over to my directory:
```
cp AH1_R1.fastq.gz AH1_R2.fastq.gz AH2_R1.fastq.gz AH2_R2.fastq.gz AH3_R1.fastq.gz AH3_R2.fastq.gz AH4_R1.fastq.gz AH4_R2.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq
```
# First step, FastQC
Install FastQC: (in this location ```/group/ctbrowngrp3/vanburen/horse/fastq```)
```
module load mamba
mamba install fastqc
mamba create --name fastqcenvironment fastqc
mamba activate fastqcenvironment
```
Unzip them:
###### (in the future: fasta files I DO unzip, fastq files I shouldn't unzip). Fasta files are just the sequence and fastq contains quality information etc.
```
gunzip *.fastq.gz
```
Can start tmux environment if leaving the terminal
```
tmux new -s fastq
```
- can detach from tmux with cntl + b, then d. Can reattach with ```tmux attach```
Running FastQC on my Fastq files (from my folder with the files):
```
fastqc *.fastq
```
Check the files on rstudio server through on demand: https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/
# Next step, trimming reads
Create trimming session:
```
tmux new -s trimming
```
- make sure to give yourself srun resources IN the tmux session. This script takes long time to run so I gave myself:
- ```srun -p high2 --time=48:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash```
Install a trimming program in a trimming environment:
```
module load mamba
mamba install flexbar
mamba create --name flexbarenv flexbar
mamba activate flexbarenv
```
Creating the trimming script:
```
touch trimscript
vim trimscript
```
- shift + i allows you to type in the script. To exit it, press 'esc' followed by ':' and type 'wq'
FastQC revealed that these adapter sequences are in it: Illumina Universal Adapter, Illumina Small RNA 3' and 5' Adapters, Nextera Transposase Sequence, Poly-A, and Poly-G
So I am making a custom file with them:
```
cat > custom_adapters.fa <<EOF
>Illumina Universal Adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina Small RNA 3' Adapter
TGGAATTCTCGGGTGCCAAGG
>Illumina Small RNA 5' Adapter
GTTCAGAGTTCTACAGTCCGACGATC
>Nextera Transposase Sequence
CTGTCTCTTATACACATCT
>Poly-A
AAAAAAAAAAAAAAAAAAAA
>Poly-G
GGGGGGGGGGGGGGGGGGGG
EOF
```
Trimming parameters, in the script called trimscript
```
#!/bin/bash
set -e
set -x
# Specify the directory containing the FASTQ files
fastq_dir="/home/vanburen/local/horse/fastq"
# Specify the path to the adapter sequences file
adapters="/home/vanburen/local/horse/fastq/custom_adapters.fa"
# Navigate to the directory containing the FASTQ files
cd "$fastq_dir"
# Loop through each R1 file and run Flexbar
for r1_file in *_R1.fastq; do
# Derive the sample prefix (e.g., AH1 from AH1_R1.fastq)
sample_prefix="${r1_file%_R1.fastq}"
# Identify the corresponding R2 file
r2_file="${sample_prefix}_R2.fastq"
# Check if R2 file exists
if [[ -f "$r2_file" ]]; then
# Run Flexbar for the pair
flexbar --reads "$r1_file" --reads2 "$r2_file" \
--zip-output GZ -t "${sample_prefix}_trimmed" \
-a "$adapters" -n 10 --min-read-length 36 \
-q TAIL -qf i1.8 -qt 20
else
echo "Matching R2 file for $r1_file not found."
fi
done
```
Give myself rights to the script: ```chmod 755 trimscript.sh```
Running the command:```bash trimscript.sh```
- Can also capture the output with ```bash trimscript.sh 2> out.txt``` and then followed by ```grep ^+ out.txt```. If you are getting errors, you can do ```bash trimscript.sh > out.txt 2> err.txt```
- note, that Flexbar automatically outputs a log file, so this isn't necessary for this command, however for some scripts it may be if you want an output log.
Can detach from tmux with cntl + b, then d. Can reattach with ```tmux attach``` or ```tmux attach -t trimming```
Alterinatively, if not leaving Farm, can be run as: ```nohup ./trimscript.sh &```
- Leave (but keep running) the script with control + c, view progress on the job with ```jobs```
After trimming, you want to FastQC the trimmed reads. Make sure to go back to fastqc environment with ```module load mamba``` and ```mamba activate fastqcenvironment```.
```
fastqc *trim.fq.gz
```
Again, can check those html images in Rstudio server or OnDemand.
# Next step, mapping!
Once each horse trimming is complete (since it is completing one at a time), I am making tmux environments for each horse and individually running this command for each.
Ie for AH1, ```tmux new -s mapping1```
Making sure to allocate myself resources within the tmux session:
```
srun -p high2 --time=99:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 5 --mem 25GB --pty /bin/bash
```
#### Create/activate the mapping environment:
```
module load mamba
mamba install -c bioconda bwa
mamba create --name mapenv bwa
mamba activate mapenv
```
Add Samtools:
```
mamba activate mapenv
mamba install -c bioconda samtools
```
**Now running the mapping, this is the command for each horse run in the terminal:**
```
bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_trimmed_2.fastq.gz | samtools view -o AH1_trimmed.bam
```
```
bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH2/AH2_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH2/AH2_trimmed_2.fastq.gz | samtools view -o AH2_trimmed.bam
```
```
bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH3/AH3_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH3/AH3_trimmed_2.fastq.gz | samtools view -o AH3_trimmed.bam
```
```
bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH4/AH4_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH4/AH4_trimmed_2.fastq.gz | samtools view -o AH4_trimmed.bam
```
***Be prepared to wait, mapping takes a long ass time***
Exit tmux sessions with control b + d. Reattach with, example, ```tmux attach -t mapping1```
- once complete, can stop the tmux session with ```tmux kill-session -t session_name```
# Next are the read adjustment steps!
### First, you will be sorting the reads:
Create a tmux session
```
tmux new -s reads
```
Allocate resources
```
srun -p high2 --time=5:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash
```
Create a reads enviornment with samtools and picard tools:
```
module load mamba
mamba install samtools
mamba create --name readsenv samtools
mamba activate readsenv
```
```
mamba activate readsenv
mamba install -c bioconda picard
```
Sort the reads for AH1:
```
samtools sort -o AH1_sorted.bam AH1_trimmed.bam
```
#### Create another tmux session and do it for your following horses, ie, for AH2:
```
tmux new -s reads2
```
```
srun -p high2 --time=5:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash
```
```
module load mamba
mamba activate readsenv
```
```
samtools sort -o AH2_sorted.bam AH2_trimmed.bam
```
### After sorting reads, you will need to index the bam file
Create a tmux environment with resource space and a samtools package:
```
tmux new -s index
```
```
srun -p high2 --time=2:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 1 --mem 5GB --pty /bin/bash
```
```
module load mamba
mamba activate readsenv
```
Now index the sorted bams: (takes about 10 min for each)
```
samtools index AH1_sorted.bam
```
```
samtools index AH2_sorted.bam
```
```
samtools index AH3_sorted.bam
```
```
samtools index AH4_sorted.bam
```
### Following sort and index, you need to add read group using picard tools:
Set up your tmux, space, and environment:
```
tmux new -s AH1
```
```
srun -p high2 --time=4:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 15GB --pty /bin/bash
```
```
module load mamba
mamba activate readsenv
```
The code that will add read groups:
```
picard AddOrReplaceReadGroups \
I=AH1_sorted.bam \
O=AH1_withRG.bam \
RGID=AH1 \
RGLB=library1 \
RGPL=ILLUMINA \
RGPU=unit1 \
RGSM=AH1
```
- **Repeat these steps above with your other samples, making sure to replace with the correct name.**
### Now you need to mark duplicates and index the reads:
Using picard tools:
```
tmux attach -t AH1
```
```
srun -p high2 --time=4:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 15GB --pty /bin/bash
```
```
module load mamba
mamba activate readsenv
```
Mark duplicates:
```
picard MarkDuplicates \
INPUT=AH1_withRG.bam \
OUTPUT=AH1_withMD.bam \
METRICS_FILE=metrics.txt \
REMOVE_DUPLICATES=false
```
Index:
```
samtools index /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam
```
Do the same steps for the other horses by adjusting the sample names.
### Following, you need to do BaseRecalibrator
First, need to download the variants from https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/ to ```/home/vanburen/local/horse/variants```
```
wget https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/equus_caballus.vcf.gz
# unzip it
gunzip equus_caballus.vcf.gz
```
```
wget https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/equus_caballus.vcf.gz.csi
```
Need to change the vcf file to match the naming convention of the ref genome:
```
sed 's/^1\>/chr&/' equus_caballus.vcf > equus_caballus_renamed.vcf
```
Go to your tmux:
```
tmux attach -t AH1
```
Give yourself resources:
```
srun -p high2 --time=8:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 4 --mem 20GB --pty /bin/bash
```
Now, activate environment with gatk:
```
module load mamba
mamba install -c bioconda gatk4
mamba create -n gatkenv gatk4
mamba activate gatkenv
```
Copy over your genome:
```
cp /home/pengsc/reference/equcab3/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/variants/
cp /home/pengsc/reference/equcab3/UCSC/equCab3.fa.fai /group/ctbrowngrp3/vanburen/horse/variants/
```
Generate your index files and dict:
```
mamba install htslib -c bioconda
tabix -p vcf equus_caballus.vcf.gz
```
```
gatk IndexFeatureFile -I /home/vanburen/local/horse/variants/equus_caballus_renamed.vcf
```
```
gatk CreateSequenceDictionary -R /home/pengsc/reference/equcab3/UCSC/equCab3.fa -O /group/ctbrowngrp3/vanburen/horse/variants/equCab3.dict
```
Generate a recalibration table for your BAM file:
```
gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' BaseRecalibrator \
-I /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam \
-R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa \
--known-sites /home/vanburen/local/horse/variants/equus_caballus_renamed.vcf \
-O /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_recal_table.table
```
Then apply BQSR:
```
gatk --java-options "-Xmx20g -XX:ParallelGCThreads=4" ApplyBQSR \
-R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa \
-I /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam \
--bqsr-recal-file /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_recal_table.table \
-O /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_GATK.bam
```
And lastly,
```
gatk --java-options "-Xmx20g -XX:ParallelGCThreads=4" HaplotypeCaller --native-pair-hmm-threads 4 -R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa -I AH1_GATK.bam -O AH1_GATK.g.vcf.gz -ERC GVCF -ploidy 2
```
- Allocate at least 60 hours for this one
- *Adjust for your other sample names.*
# Last step, variant calling!
#### Set up your tmux, environment, and resources:
Create tmux:
```
tmux new -s variantcalling
```
Give yourself resources:
```
srun -p high2 --time=30:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 4 --mem 25GB --pty /bin/bash
```
Activate environment:
```
module load mamba
mamba activate gatkenv
```
Use the fasta file index to generate your interval:
```
awk '{print $1":"1"-"$2}' /home/pengsc/reference/equcab3/UCSC/equCab3.fa.fai > whole_genome.intervals
```
I then removed unnamed chr in a new file called ```wholegenome.intervals```
We are now variant calling all of our samples together,
```
gatk --java-options "-Xmx20g -Xms20g -XX:ParallelGCThreads=4" GenomicsDBImport \
--genomicsdb-shared-posixfs-optimizations true \
-V /home/vanburen/local/horse/fastq/AH1/AH1_GATK.g.vcf.gz \
-V /home/vanburen/local/horse/fastq/AH2/AH2_GATK.g.vcf.gz \
--genomicsdb-workspace-path variantsAH \
--reference /home/pengsc/reference/equcab3/UCSC/equCab3.fa \
--tmp-dir /home/vanburen/local/dupa/ \
-L wholegenome.intervals
```
After that is complete,
```
gatk --java-options "-Xmx20g -Xms20g -XX:ParallelGCThreads=4" GenotypeGVCFs \
-R /home/pengsc/reference/equcab3/UCSC/equCab3.fa \
-V gendb://variantsAH \
-O AHvariants.vcf.gz \
--tmp-dir /home/vanburen/local/dupa2/
```
The final output of my variant calling is: ```/home/vanburen/local/horse/fastq/variantsAH/AHvariants.vcf.gz```
Variants can be viewed with: ```zcat AHvariants.vcf.gz | less```
**Whoop whoop! Now you have completed calling variants in your samples!**