# Trimming, mapping and variant calling from horse FastQ files Sam's quickstart messy guide to trimming, mapping, and variant calling in the horse🤯 ![Screen Shot 2024-03-03 at 2.49.24 PM](https://hackmd.io/_uploads/ByStAdz66.png) # Starting out Login to Farm: ``` ssh -i /Users/Samantha/.ssh/id_rsa_farm vanburen@farm.hpc.ucdavis.edu ``` Allocate resources: ``` srun -p high2 --time=2:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 1 --mem 5GB --pty /bin/bash ``` Go to the directory where the files are: ``` cd /group/ctbrowngrp/finnolab_shared/fastq/FAANG/WGS ``` Copy them over to my directory: ``` cp AH1_R1.fastq.gz AH1_R2.fastq.gz AH2_R1.fastq.gz AH2_R2.fastq.gz AH3_R1.fastq.gz AH3_R2.fastq.gz AH4_R1.fastq.gz AH4_R2.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq ``` # First step, FastQC Install FastQC: (in this location ```/group/ctbrowngrp3/vanburen/horse/fastq```) ``` module load mamba mamba install fastqc mamba create --name fastqcenvironment fastqc mamba activate fastqcenvironment ``` Unzip them: ###### (in the future: fasta files I DO unzip, fastq files I shouldn't unzip). Fasta files are just the sequence and fastq contains quality information etc. ``` gunzip *.fastq.gz ``` Can start tmux environment if leaving the terminal ``` tmux new -s fastq ``` - can detach from tmux with cntl + b, then d. Can reattach with ```tmux attach``` Running FastQC on my Fastq files (from my folder with the files): ``` fastqc *.fastq ``` Check the files on rstudio server through on demand: https://ondemand.farm.hpc.ucdavis.edu/pun/sys/dashboard/ # Next step, trimming reads Create trimming session: ``` tmux new -s trimming ``` - make sure to give yourself srun resources IN the tmux session. This script takes long time to run so I gave myself: - ```srun -p high2 --time=48:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash``` Install a trimming program in a trimming environment: ``` module load mamba mamba install flexbar mamba create --name flexbarenv flexbar mamba activate flexbarenv ``` Creating the trimming script: ``` touch trimscript vim trimscript ``` - shift + i allows you to type in the script. To exit it, press 'esc' followed by ':' and type 'wq' FastQC revealed that these adapter sequences are in it: Illumina Universal Adapter, Illumina Small RNA 3' and 5' Adapters, Nextera Transposase Sequence, Poly-A, and Poly-G So I am making a custom file with them: ``` cat > custom_adapters.fa <<EOF >Illumina Universal Adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA >Illumina Small RNA 3' Adapter TGGAATTCTCGGGTGCCAAGG >Illumina Small RNA 5' Adapter GTTCAGAGTTCTACAGTCCGACGATC >Nextera Transposase Sequence CTGTCTCTTATACACATCT >Poly-A AAAAAAAAAAAAAAAAAAAA >Poly-G GGGGGGGGGGGGGGGGGGGG EOF ``` Trimming parameters, in the script called trimscript ``` #!/bin/bash set -e set -x # Specify the directory containing the FASTQ files fastq_dir="/home/vanburen/local/horse/fastq" # Specify the path to the adapter sequences file adapters="/home/vanburen/local/horse/fastq/custom_adapters.fa" # Navigate to the directory containing the FASTQ files cd "$fastq_dir" # Loop through each R1 file and run Flexbar for r1_file in *_R1.fastq; do # Derive the sample prefix (e.g., AH1 from AH1_R1.fastq) sample_prefix="${r1_file%_R1.fastq}" # Identify the corresponding R2 file r2_file="${sample_prefix}_R2.fastq" # Check if R2 file exists if [[ -f "$r2_file" ]]; then # Run Flexbar for the pair flexbar --reads "$r1_file" --reads2 "$r2_file" \ --zip-output GZ -t "${sample_prefix}_trimmed" \ -a "$adapters" -n 10 --min-read-length 36 \ -q TAIL -qf i1.8 -qt 20 else echo "Matching R2 file for $r1_file not found." fi done ``` Give myself rights to the script: ```chmod 755 trimscript.sh``` Running the command:```bash trimscript.sh``` - Can also capture the output with ```bash trimscript.sh 2> out.txt``` and then followed by ```grep ^+ out.txt```. If you are getting errors, you can do ```bash trimscript.sh > out.txt 2> err.txt``` - note, that Flexbar automatically outputs a log file, so this isn't necessary for this command, however for some scripts it may be if you want an output log. Can detach from tmux with cntl + b, then d. Can reattach with ```tmux attach``` or ```tmux attach -t trimming``` Alterinatively, if not leaving Farm, can be run as: ```nohup ./trimscript.sh &``` - Leave (but keep running) the script with control + c, view progress on the job with ```jobs``` After trimming, you want to FastQC the trimmed reads. Make sure to go back to fastqc environment with ```module load mamba``` and ```mamba activate fastqcenvironment```. ``` fastqc *trim.fq.gz ``` Again, can check those html images in Rstudio server or OnDemand. # Next step, mapping! Once each horse trimming is complete (since it is completing one at a time), I am making tmux environments for each horse and individually running this command for each. Ie for AH1, ```tmux new -s mapping1``` Making sure to allocate myself resources within the tmux session: ``` srun -p high2 --time=99:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 5 --mem 25GB --pty /bin/bash ``` #### Create/activate the mapping environment: ``` module load mamba mamba install -c bioconda bwa mamba create --name mapenv bwa mamba activate mapenv ``` Add Samtools: ``` mamba activate mapenv mamba install -c bioconda samtools ``` **Now running the mapping, this is the command for each horse run in the terminal:** ``` bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_trimmed_2.fastq.gz | samtools view -o AH1_trimmed.bam ``` ``` bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH2/AH2_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH2/AH2_trimmed_2.fastq.gz | samtools view -o AH2_trimmed.bam ``` ``` bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH3/AH3_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH3/AH3_trimmed_2.fastq.gz | samtools view -o AH3_trimmed.bam ``` ``` bwa mem /group/ctbrowngrp/finnolab_shared/ref/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/fastq/AH4/AH4_trimmed_1.fastq.gz /group/ctbrowngrp3/vanburen/horse/fastq/AH4/AH4_trimmed_2.fastq.gz | samtools view -o AH4_trimmed.bam ``` ***Be prepared to wait, mapping takes a long ass time*** Exit tmux sessions with control b + d. Reattach with, example, ```tmux attach -t mapping1``` - once complete, can stop the tmux session with ```tmux kill-session -t session_name``` # Next are the read adjustment steps! ### First, you will be sorting the reads: Create a tmux session ``` tmux new -s reads ``` Allocate resources ``` srun -p high2 --time=5:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash ``` Create a reads enviornment with samtools and picard tools: ``` module load mamba mamba install samtools mamba create --name readsenv samtools mamba activate readsenv ``` ``` mamba activate readsenv mamba install -c bioconda picard ``` Sort the reads for AH1: ``` samtools sort -o AH1_sorted.bam AH1_trimmed.bam ``` #### Create another tmux session and do it for your following horses, ie, for AH2: ``` tmux new -s reads2 ``` ``` srun -p high2 --time=5:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 20GB --pty /bin/bash ``` ``` module load mamba mamba activate readsenv ``` ``` samtools sort -o AH2_sorted.bam AH2_trimmed.bam ``` ### After sorting reads, you will need to index the bam file Create a tmux environment with resource space and a samtools package: ``` tmux new -s index ``` ``` srun -p high2 --time=2:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 1 --mem 5GB --pty /bin/bash ``` ``` module load mamba mamba activate readsenv ``` Now index the sorted bams: (takes about 10 min for each) ``` samtools index AH1_sorted.bam ``` ``` samtools index AH2_sorted.bam ``` ``` samtools index AH3_sorted.bam ``` ``` samtools index AH4_sorted.bam ``` ### Following sort and index, you need to add read group using picard tools: Set up your tmux, space, and environment: ``` tmux new -s AH1 ``` ``` srun -p high2 --time=4:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 15GB --pty /bin/bash ``` ``` module load mamba mamba activate readsenv ``` The code that will add read groups: ``` picard AddOrReplaceReadGroups \ I=AH1_sorted.bam \ O=AH1_withRG.bam \ RGID=AH1 \ RGLB=library1 \ RGPL=ILLUMINA \ RGPU=unit1 \ RGSM=AH1 ``` - **Repeat these steps above with your other samples, making sure to replace with the correct name.** ### Now you need to mark duplicates and index the reads: Using picard tools: ``` tmux attach -t AH1 ``` ``` srun -p high2 --time=4:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 15GB --pty /bin/bash ``` ``` module load mamba mamba activate readsenv ``` Mark duplicates: ``` picard MarkDuplicates \ INPUT=AH1_withRG.bam \ OUTPUT=AH1_withMD.bam \ METRICS_FILE=metrics.txt \ REMOVE_DUPLICATES=false ``` Index: ``` samtools index /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam ``` Do the same steps for the other horses by adjusting the sample names. ### Following, you need to do BaseRecalibrator First, need to download the variants from https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/ to ```/home/vanburen/local/horse/variants``` ``` wget https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/equus_caballus.vcf.gz # unzip it gunzip equus_caballus.vcf.gz ``` ``` wget https://ftp.ensembl.org/pub/current_variation/vcf/equus_caballus/equus_caballus.vcf.gz.csi ``` Need to change the vcf file to match the naming convention of the ref genome: ``` sed 's/^1\>/chr&/' equus_caballus.vcf > equus_caballus_renamed.vcf ``` Go to your tmux: ``` tmux attach -t AH1 ``` Give yourself resources: ``` srun -p high2 --time=8:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 4 --mem 20GB --pty /bin/bash ``` Now, activate environment with gatk: ``` module load mamba mamba install -c bioconda gatk4 mamba create -n gatkenv gatk4 mamba activate gatkenv ``` Copy over your genome: ``` cp /home/pengsc/reference/equcab3/UCSC/equCab3.fa /group/ctbrowngrp3/vanburen/horse/variants/ cp /home/pengsc/reference/equcab3/UCSC/equCab3.fa.fai /group/ctbrowngrp3/vanburen/horse/variants/ ``` Generate your index files and dict: ``` mamba install htslib -c bioconda tabix -p vcf equus_caballus.vcf.gz ``` ``` gatk IndexFeatureFile -I /home/vanburen/local/horse/variants/equus_caballus_renamed.vcf ``` ``` gatk CreateSequenceDictionary -R /home/pengsc/reference/equcab3/UCSC/equCab3.fa -O /group/ctbrowngrp3/vanburen/horse/variants/equCab3.dict ``` Generate a recalibration table for your BAM file: ``` gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' BaseRecalibrator \ -I /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam \ -R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa \ --known-sites /home/vanburen/local/horse/variants/equus_caballus_renamed.vcf \ -O /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_recal_table.table ``` Then apply BQSR: ``` gatk --java-options "-Xmx20g -XX:ParallelGCThreads=4" ApplyBQSR \ -R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa \ -I /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_withMD.bam \ --bqsr-recal-file /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_recal_table.table \ -O /group/ctbrowngrp3/vanburen/horse/fastq/AH1/AH1_GATK.bam ``` And lastly, ``` gatk --java-options "-Xmx20g -XX:ParallelGCThreads=4" HaplotypeCaller --native-pair-hmm-threads 4 -R /group/ctbrowngrp3/vanburen/horse/variants/equCab3.fa -I AH1_GATK.bam -O AH1_GATK.g.vcf.gz -ERC GVCF -ploidy 2 ``` - Allocate at least 60 hours for this one - *Adjust for your other sample names.* # Last step, variant calling! #### Set up your tmux, environment, and resources: Create tmux: ``` tmux new -s variantcalling ``` Give yourself resources: ``` srun -p high2 --time=30:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 4 --mem 25GB --pty /bin/bash ``` Activate environment: ``` module load mamba mamba activate gatkenv ``` Use the fasta file index to generate your interval: ``` awk '{print $1":"1"-"$2}' /home/pengsc/reference/equcab3/UCSC/equCab3.fa.fai > whole_genome.intervals ``` I then removed unnamed chr in a new file called ```wholegenome.intervals``` We are now variant calling all of our samples together, ``` gatk --java-options "-Xmx20g -Xms20g -XX:ParallelGCThreads=4" GenomicsDBImport \ --genomicsdb-shared-posixfs-optimizations true \ -V /home/vanburen/local/horse/fastq/AH1/AH1_GATK.g.vcf.gz \ -V /home/vanburen/local/horse/fastq/AH2/AH2_GATK.g.vcf.gz \ --genomicsdb-workspace-path variantsAH \ --reference /home/pengsc/reference/equcab3/UCSC/equCab3.fa \ --tmp-dir /home/vanburen/local/dupa/ \ -L wholegenome.intervals ``` After that is complete, ``` gatk --java-options "-Xmx20g -Xms20g -XX:ParallelGCThreads=4" GenotypeGVCFs \ -R /home/pengsc/reference/equcab3/UCSC/equCab3.fa \ -V gendb://variantsAH \ -O AHvariants.vcf.gz \ --tmp-dir /home/vanburen/local/dupa2/ ``` The final output of my variant calling is: ```/home/vanburen/local/horse/fastq/variantsAH/AHvariants.vcf.gz``` Variants can be viewed with: ```zcat AHvariants.vcf.gz | less``` **Whoop whoop! Now you have completed calling variants in your samples!**