---
title: "Brendan's DENU pt. 2 - Amplicon Processing"
breaks: FALSE
tags: RADSeq
---
Author: Brendan Connolly (p31939)
---
# Amplicon analysis - SNPS
Steps:
1.) FastQC
2.) Illumina Adapter trimming and filtering out reads <30 bases long (Trimmomatic)
3.) Alignment (BWA-MEM, (see minimap2 for long-read alignment; map to ddRad FASTA file))
4.) Primer trimming (Primerclip or TrimPrimers)
5.) On-target and coverage uniformity calculation using Picard tools
6.) Variant calling of your choice (GATK Haplotype Caller for germline calls, lofreq, mutect or Vardict for somatic or
low frequency variant calls)
Results in VCF file
- establishing premise that genetic distance between moms; closer = closer related
- how inbred are individuals
- delete mom's DNA from data set, just leaving us with Dad's. We can work out pollen diversity.
1.) relatedness
2.) Pollen diversity
3.) Extent of homozygozity (Ho)
Conduct t-test between pollinator groups
## 2.) Trimmomatic
[Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md)
Most files coming straight from the sequencing facility have a few extra characters in the names. Clean up the file names that have something like samplename_S79_R1_001.fastq.gz:
```bash=
#to remove the S79:
for filename in *fastq.gz;
do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done
#to remove the _001_:
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done
#to remove the L001_:
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_L001*_/_/')"; done
```
Trimmomatic filter out low quality reads, remove adapter sequences, etc. Trimmomatic conda environment needs to be activated before running.
```bash=
conda activate trimmomatic
```
:::spoiler Trimmomatic Arguments
- `PE -phred33` refers to the quality scoring scheme that is used in the raw reads
- `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on
- `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files
- The remaining arguments informs trimmomatic's decision in keeping/removing sequences
:::
```bash=
#Use nano to create a shell script and name it -> trimmomatic.sh
#For loop to run on folder of samples
for _R1_ in *_R1*; do
R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz}
R1p=${_R1_//.fastq.gz/_paired.fastq.gz}
R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz}
R2p=${R2//.fastq.gz/_paired.fastq.gz}
R2u=${R2//.fastq.gz/_unpaired.fastq.gz}
echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh
done
```
This should be run using a slurm environment. Create a slurm environment using nano and name it `slurm_trimmomatic.sh`
```
#!/bin/bash
#SBATCH --account=p31939
#SBATCH --partition=normal
#SBATCH --time=24:00:00
#SBATCH --mem=1GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=trimmomatic
#SBATCH --output=trim.out
#SBATCH --error=trimm.error
#SBATCH --mail-type=ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
bash ./trimmomatic.sh
```
Trimmomatic cannot utilize more than 1GB of memory, so increasing does not make it run faster. I had ~250 files and ~168GB of data, which took a little over 14 hours to run.
To run use sbatch command
```bash=
sbatch slurm_trimmomatic.sh
```
After trimmomatic runs you will have the original file, R1 and R1 paired files, and R1 and R2 unpaired files.
Examples:
HE1_R1.fastq.gz
HE1_R2.fastq.gz
HE1_R1_paired.fastq.gz
HE1_R2_paired.fastq.gz
HE1_R1_unpaired.fastq.gz
HE1_R2_unpaired.fastq.gz
For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file.
To concatenate the two unpaired files into one
```bash=
for file in *R1_unpaired*
do
file2=${file//_R1_/_R2_}
file3=${file//R1_unpaired/unpaired}
cat $file $file2 > $file3
done
```
Take a moment to organize your files. Make a new directory for your ./raw_reads and within that, a new directory for your ./unconcatenated_unpaired_reads.
Now you should only have these three files per sample:
HE1_R1_paired.fastq.gz
HE1_R2_paired.fastq.gz
HE1_unpaired.fastq.gz
## 3.) Alignment - BWA
BWA allows us to align our reads with a reference genome. We're going to align reads to the FASTA file of the 96 loci that we created primers for
First let's install BWA and Samtools
```
conda create --name BWA
conda activate BWA
conda install bioconda::bwa
conda install bioconda::samtools
```
### Indexing
```
bwa index Final96_loci_sequences.fa
```
### Alignment
Testing on a single sample
```
bwa mem -M -t 4 Final96_loci_sequences.fa 69-F6-BU-3_R1_paired.fastq.gz 69-F6-BU-3_R2_paired.fastq.gz > 69-F6-BU-3.sam
#Looking at how successful mapping was
samtools flagstat 69-F6-BU-3.sam
#Turning sam into bam (more storage efficient)
samtools view 69-F6-BU-3.sam -b -o 69-F6-BU-3.bam
#Sorting bam file
samtools sort 69-F6-BU-3.bam -o 69-F6-BU-3_sort.bam
```
Looks good, let's try it on all samples now. Make the following script called align_sort.sh and make a new folder called ./align. Run the script through Slurm.
This takes a while to run, I split my 266 samples into groups of ~26 by moving them into seperate directories, then ran the below script on each directory so that BWA would run faster.
BWA also allows multiple threads - Quest allows up to 28. Change -t to 12 and ntasks-per-node to 12.
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=48:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=12 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=6G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=bwa_align ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=bwa_align.error # Enter a error log file name
#SBATCH --output=bwa_align.out # Enter your out.log file name
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/BWA
INDS=($(for i in ./*R1_paired.fastq.gz; do echo $(basename ${i%_R*}); done))
for IND in ${INDS[@]};
do
# declare variables
FORWARD=./${IND}_R1_paired.fastq.gz
REVERSE=./${IND}_R2_paired.fastq.gz
OUTPUT=./align/${IND}_sort.bam
# then align and sort
echo "Aligning $IND with bwa"
bwa mem -M -t 12 Final96_loci_sequences.fa $FORWARD $REVERSE | samtools view -b | \
samtools sort -T ${IND} > $OUTPUT
done
```
## 4.) PrimerClip
Primerclip is used further remove low quality bases and PCR primer sequeunces from trimmed and aligned reads. https://github.com/swiftbiosciences/primerclip
We need a Masterfile of loci of interest and PCR primers; IDT directly sent me one for my loci. This Stack Exchange has more info on the masterfile.
:::spoiler Format of masterfile is a .txt with 12 columns:
1) Chromosome
2) Target Start
3) Target Stop
4) Target Name (change which affected everything downstream in counting)
5) 5' Primer Start
6) 5' Primer Stop
7) 5' Primer Name
8) 3' Primer Start
9) 3' Primer Stop
10) 3' Primer Name
11) 5' Primer Sequence
12) 3' Primer Sequence
:::
A BEDPE file can also be used. https://bioinformatics.stackexchange.com/questions/5489/what-is-the-primer-masterfile-format
Primer Clip expects input files to be .sam, we have to turn the .bam files back into .sam, then name sort each, then we can run primer clip.
Before running let's make new conda env called primerclip.
```
conda create -n primerclip
conda activate primerclip
conda install bioconda::primerclip
conda install bioconda/label/cf201901::samtools
#Ran into samtools error: "Samtools shared library libcrypto.so.1.0.0 not found". Found this solution where you replace libcrypto.so.3 with libcrypto.so.1.0.0
cd /home/bjc8165/miniconda3/envs/primerclip/lib
ln -s libcrypto.so.3 libcrypto.so.1.0.0
```
Make the following slurm script for runnign primer clip - PrimerClip.sh
The sam files are huge and 1tb of space on Quest was not enough storage, so I transferred everything to Curie and ran the below script.
```
#!/bin/bash
#Load Conda Environment
source /home/bjc8165/miniconda3/bin/activate primerclip
INDS=($(for i in ./*_sort.sam; do echo $(basename ${i%_sort*}); done)) #declares our individual names
for IND in ${INDS[@]};
do
samtools sort -n ${IND}_sort.sam -o ${IND}_namesort.sam;
primerclip Primer_Sequences_DelphiniumAmplicon.txt ${IND}_namesort.sam ${IND}_clipped.sam;
samtools view -b ${IND}_clipped.sam -o ${IND}_clipped.bam;
samtools sort ${IND}_clipped.bam -o ${IND}_clipped_sort.bam;
done
```
Let's remove the intermediary files and make a md5sum file for everything here for when we tranfer it back to quest
```
#Remove intermediary name-sorted files
rm *name* *clipped.sam *clipped.bam
#Organize Files
mv *clipped_sort.bam ./home/bjc8165/clipped
mv *.log ./home/bjc8165/clipped
#Generate md5sum file compiling md5hashes for everything in directory
find -type f -exec md5sum '{}' \; > curie_samsorted_md5sum.txt
find -type f -exec md5sum '{}' \; > curie_samclipped_md5sum.txt
#Run in powershell to transfer md5 files to local
scp bjc8165@10.2.0.53:/home/bjc8165/PrimerClip/curie_samsorted_md5sum.txt C:/Users/Brendan/Downloads
scp bjc8165@10.2.0.53:/home/bjc8165/clipped/curie_samclipped_md5sum.txt C:/Users/Brendan/Downloads
```
:::spoiler Trimmal - MAY NOT NEED TO DO
When sequences are aligned, a lot of gaps are introduced, especially towards the beginning/ends of sequences. This is because all sequences must end up being the same length, which is at minimum the longest sequence in the file. We can get rid of base pair positions that are mostly gaps using trimAl. Use -gt 0.5 to remove base pair positions that are 50% gaps. Note I installed trimAl in my conda environment instead of using modules on quest.
Before running TrimAl, you need to create a conda environment and download it. To do this first create an environment
```
conda create --name trimal
#Then activate the environment
conda activate trimal
#Then download trimal
conda install trimal
```
Now you can run the script below
```
#!/bin/bash
#SBATCH --account=p32038
#SBATCH --partition=short
#SBATCH --time=00:30:00
#SBATCH --mem=2G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=lob_trim
#SBATCH --error=trim.log
#SBATCH --mail-type=ALL
#SBATCH --mail-user=stephangirard2024@u.northwestern.edu
for i in *.fasta; do
trimal -in $i -out ${i//aligned/trim} -gt 0.5
done
```
:::
## 5.) On-target and coverage uniformity calculation using Picard tools
we need to first remove duplicate reads from the dataset to avoid PCR duplicates and technical duplicates which inflate our sequencing depth and give us false certainty in the genotype calls. We can use Picard Tools to do this: https://broadinstitute.github.io/picard/ OR GATK, which has Picard built in: https://github.com/broadinstitute/gatk
```
conda create -n GATK
conda activate GATK
conda install bioconda::gatk4
conda install bioconda::picard
module load bamtools
```
Calculating depth of coverage at each loci. This can be done with Picard and Bedtools. First we'll use Bedtools as it's easier to use
```
###############################################
##Use bedtools to check coverage of each loci##
###############################################
#We need to first turn out fasta file to a .dict file
picard CreateSequenceDictionary / R=Final96_loci_sequences.fa / O=Final96_loci_sequences.dict
#Then Index our fasta file
samtools faidx Final96_loci_sequences.fa
#Coverage with bedtools
#Make a new interval file with loci names pulled from a bed file - The coordinates for the target loci here are 1 to the end of the loci as we are also interested in variant sites on either side of the original SNP
samtools view -H 108-109_NF1_clipped_sort.bam | grep SN | cut -f 2,3 | sed 's/SN://g' | sed 's/LN://g' | awk '{print $1 "\t1\t" $2}' > bedtools_interval.bed
#Index each bam file
samtools index 109-F4-BU-3_clipped_sort.bam
#Checking coverage with bedtools
bedtools coverage -a bedtools_interval.bed -b 109-F4-BU-3_clipped_sort.bam
#For loop to do everything at once:
module load bedtools
INDS=($(for i in ./*_clipped_sort.bam; do echo $(basename ${i%_clipped*}); done)) #declares our individual names
for IND in ${INDS[@]};
do
OUTPUT=./coverage_logs/${IND}_coverage_log
samtools index ${IND}_clipped_sort.bam;
bedtools coverage -a bedtools_interval.bed -b ${IND}_clipped_sort.bam > $OUTPUT;
done
##################################################
Merging all samples into one and checking coverage
##################################################
samtools merge -n allfiles.bam *.bam
###############################################################
### Work in Progress: Extra code, if you're using PicardTools##
###############################################################
#We need to make a interval_list that has loci name and position of loci
#We only need first three columns from the sequence.txt file
cut -f 1-3 Primer_Sequences_DelphiniumAmplicon.txt Primer_Sequences_DelphiniumAmplicon.bed
picard BedToIntervalList -I Primer_Sequences_DelphiniumAmplicon.bed -O interval_list -SD Final96_loci_sequences.fa
#Remove headers from interval file, Picard needs headers, but Bamtools does not
#grep -v "^@" test.interval_list > no_headers.interval_list
#Duplicate the interval list and call it the target list
cp test.interval_list test.target_list
picard CollectTargetedPcrMetrics -I 108-109_NF1_clipped_sort.bam -O test_pcr_metrics.txt -AI test.interval_list -TI test.target_list
```
Picard let's you calculate a ton of statistics - https://broadinstitute.github.io/picard/picard-metric-definitions.html#TargetedPcrMetrics . Here are some that might be useful from the Targeted PCR Metrics section:
```
CUSTOM_AMPLICON_SET #The name of the amplicon set used in this metrics collection run
PF_UNIQUE_READS #The number of PF_READS that were not marked as sample or optical duplicates.
ON_AMPLICON_BASES #The number of PF_BASES_ALIGNED that mapped to an amplified region of the genome.
NEAR_AMPLICON_BASES #The number of PF_BASES_ALIGNED that mapped to within a fixed interval of an amplified region, but not on a baited region.
PCT_AMPLIFIED_BASES #The fraction of PF_BASES_ALIGNED that mapped to or near an amplicon, (ON_AMPLICON_BASES + NEAR_AMPLICON_BASES)/PF_BASES_ALIGNED.
MEAN_AMPLICON_COVERAGE #The mean read coverage of all amplicon regions in the experiment.
MEAN_TARGET_COVERAGE #The mean read coverage of all target regions in an experiment.
MEDIAN_TARGET_COVERAGE #The median coverage of reads that mapped to target regions of an experiment.
FOLD_ENRICHMENT #The fold by which the amplicon region has been amplified above genomic background.
PCT_EXC_DUPE #The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
gatk ValidateSamFile -I input.bam -MODE SUMMARY
```
```
cat inds | parallel --verbose -j 10 \
java -Xmx1g -jar /home/scripts/picard.jar \
MarkDuplicates REMOVE_DUPLICATES=true \
ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
INPUT={}.bam \
OUTPUT={}.rmd.bam \
METRICS_FILE={}.rmd.bam.metrics
# Now we need to index all bam files again and that's it!
samtools index *.rmd.bam
```
## 6.) Variant Calling with STACKS - Code works but threw away too many loci
:::spoiler Variant Calling - STACKS ref_map.pl
```
#Loading stacks as a new conda env
conda create -n STACKS
conda activate STACKS
conda install bioconda::stacks
```
navigate to folder with .bam files
```
ref_map.pl --samples ./ --popmap ./pop_map.txt -s clipped.sort --out-path ./stacks_r0.2 -X "populations:-r 0.2" -X "populations:--fasta-loci" -X "populations:--fasta-samples" -X "populations:--vcf" -X "populations:--genepop" -X "populations:--fstats" -X "populations:--hwe"
```
Gstacks fails when you try to run it on all samples. I had to make a population map **with just the 80 largest samples for it to run**. Apparently this is a common problem with samples with low read depth.
The only loci that Stacks output were SNPs in the microsatellite sites. This isn't helpful since we already analyzed these below. Let's try GATk
:::
## 6.) Variant calling - GATK
Scripts to run GATk haplotype caller. Execute each script in directory with .bam files
#### Setup for GATk
```
# Create conda env for GATk4
conda create -n GATk4
conda activate GATk4
conda install -c bioconda gatk4
conda install bioconda::samtools
#Make directory for output files
mkdir GATk
```
#### Optional: Remove microsatelite reads before processing - faster
name script: rm_micro.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
#Removes any reads that start with K, which all our micro loci do
INDS=($(for i in ./*sort.bam; do echo $(basename ${i%_sort*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=${IND}_sort.bam
OUTPUT=${IND}_nomicro.bam
samtools view -h $INP | awk '$3 !~ /^K/' | samtools view -Sb > $OUTPUT
done
# Sorts each bam file.
INDS=($(for i in ./*nomicro.bam; do echo $(basename ${i%_nomicro.bam}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=${IND}_nomicro.bam
OUTPUT=${IND}_nomicro_sort.bam
samtools sort $INP -o $OUTPUT
done
rm *nomicro.bam
```
#### Add sample names to @RG tags - rename.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
INDS=($(for i in ./*nomicro_sort.bam; do echo $(basename ${i%_nomicro_sort*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=${IND}_nomicro_sort.bam
OUTPUT=${IND}_rename.bam
gatk AddOrReplaceReadGroups \
-I $INP \
-O $OUTPUT \
-RGID $IND \
-RGLB 1 \
-RGPL ILLUMINA \
-RGPU ${IND}_1 \
-RGSM $IND
done
```
#### Index the renamed files - index.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
INDS=($(for i in ./*rename.bam; do echo $(basename ${i%_r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_rename.bam
echo "Indexing $IND with samtools"
samtools index $INP
done
```
#### Run GATk haplotype caller - gatk_haplo.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
# Define variables
INDS=($(for i in ./*rename.bam; do echo $(basename ${i%_rename.bam*}); done)) #declares our individual names
for IND in ${INDS[@]}; do
gatk HaplotypeCaller \
-R Final96_loci_sequences.fa \
-I ${IND}_rename.bam \
-O ./GATk/${IND}.g.vcf.gz \
-ERC GVCF
done
```
#### Combine gVCFs - gatk_combine.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
# Initialize the GVCF list
GVCF_LIST=""
# Loop through all GVCF files in the directory
for GVCF in ./GATk/*.g.vcf.gz; do
GVCF_LIST+=" --variant ${GVCF}"
done
gatk CombineGVCFs \
-R Final96_loci_sequences.fa \
$GVCF_LIST \
-O ./GATk/combined.g.vcf.gz
```
#### genotype the combined gVCF - gatk_geno_gvcf.sh
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
gatk --java-options "-Xmx8g" GenotypeGVCFs \
-R Final96_loci_sequences.fa \
-V ./GATk/combined.g.vcf.gz \
-O ./GATk/final.vcf.gz
```
#### To run all scripts at once, do this
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=16:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=4G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need$#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=gatk_all_noprimerclip.error # Enter a error log file name
#SBATCH --output=gatk_all_noprimerclip.out # Enter your out.log file name
bash ./rm_micro.sh
bash ./rename.sh
bash ./index.sh
bash ./gatk_haplo.sh
bash ./gatk_combine.sh
bash ./gatk_geno_gvcf.sh
```
## 7.) VCF Filtering
```
conda create -n vcftools
conda activate vcftools
conda install bioconda::vcftools
cd ./GATk
gunzip final.vcf.gz
mkdir vcftools_stats
```
#### Stats on unfiltered VCF to get VCF filters
```
#!/bin/bash
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/vcftools
OUT=./vcftools_stats/final
#calculate allele frequency - exclude any sites with more than 2 allelels
vcftools --vcf final.vcf --freq2 --out $OUT --max-alleles 2
#Calculate mean depth per individual
vcftools --vcf final.vcf --depth --out $OUT
#Calculate mean depth per site
vcftools --vcf final.vcf --site-mean-depth --out $OUT
#calculate site quality
vcftools --vcf final.vcf --site-quality --out $OUT
#Calculate proportion of missing data per individual
vcftools --vcf final.vcf --missing-indv --out $OUT
#Calculate proportion of missing data per site
vcftools --vcf final.vcf --missing-site --out $OUT
#Calculate heterozygosity and inbreeding coefficient per individual
vcftools --vcf final.vcf --het --out $OUT
```
#### Analyze stats in R
download each file, place in working directory for R
##### R Script: Stats across variants
```
##########variant quality
#This is Phred encoded - Phred of 30 = a 1 in 1000 chance that a SNP call is erroneous
#OVerall quality is very high. Let's just set filter to 30
var_qual <- read_delim("./SNP_OverallStats/final.lqual", delim = "\t",
col_names = c("chr", "pos", "qual"), skip = 1)
a <- ggplot(var_qual, aes(qual)) +
geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3)+
scale_x_continuous(limits = c(0, 1000))
a + theme_light()
##########Variant mean depth
#This is the mean of the read depth across all individuals - it is for both alleles at a position
#Mean = 9.39. Median = 0.72. Max = 251. Most are between 0 and 3 dp
# As the outliers show, some regions clearly have extremely high coverage and this likely reflects mapping/assembly errors and also paralogous or repetitive regions. We should exclude. A good rule of thumb is something the mean depth x 2 - so in this case we could set our maximum depth at 20x.
# Many sites have such a low read coverage, we likely can't be too confident in these calls. Let's follow a good rule of thumb with a minimum depth of 10x
var_depth <- read_delim("./SNP_OverallStats/final.ldepth.mean", delim = "\t",
col_names = c("chr", "pos", "mean_depth", "var_depth"), skip = 1)
summary(var_depth$mean_depth)
b <- ggplot(var_depth, aes(mean_depth)) +
geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
scale_x_continuous(limits = c(0, 115))
b + theme_light()
#########Variant missingness - REVISIT
# the proportion of missingness at each variant. This is a measure of how many individuals lack a genotype at a call site
#Lots of missing data. Mean = 87% missing, median = 99% missing
var_miss <- read_delim("./SNP_OverallStats/final.lmiss", delim = "\t",
col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1)
summary(var_miss$fmiss)
c <- ggplot(var_miss, aes(fmiss)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3)
c + theme_light()
########## Minor allele frequency - REVISIT
# mean=.166, median = 0.063
# 266 samples = 532 possible alleles at a site
# The average MAF therefore means an allele appears in ~88 samples (.166 * 523)
# A MAF of 0.1 then means allele appears in ~53 samples
#It's is best practice to produce one dataset with a good MAF threshold and keep another without any MAF filtering at all.
#For now however, we will set our MAF to 0.1
var_freq <- read_delim("./SNP_OverallStats/final.frq", delim = "\t",
col_names = c("chr", "pos", "nalleles", "nchr", "a1", "a2"), skip = 1)
var_freq$maf <- var_freq %>%
dplyr::select(a1, a2) %>%
apply(1, function(z) min(z)) #Find minor allele freq.
summary(var_freq$maf)
d <- ggplot(var_freq, aes(maf)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3)
d + theme_light()
```
##### Stats across samples
```
#######Mean depth per individual
#mean = 9.38, median = 5.36, max = 68.25
#There's around 3 samples that seem like large outliers we may want to exclude - really high read depth can bias results
#8_Flav, 102-F2-GH-37, 109-F4-BU-3
ind_depth <- read_delim("./SNP_OverallStats/final.idepth", delim = "\t",
col_names = c("ind", "nsites", "depth"), skip = 1)
summary(ind_depth$depth)
e <- ggplot(ind_depth, aes(depth)) +
geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 100)
e + theme_light()
########Proportion of missing data per individual
#mean = 87.2%, median = 89.2%, min = 53.0%
#No surprise, but we have a lot of missing data that we should filter out..
ind_miss <- read_delim("./SNP_OverallStats/final.imiss", delim = "\t",
col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"), skip = 1)
summary(ind_miss$fmiss)
f <- ggplot(ind_miss, aes(fmiss)) +
geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 100)
f + theme_light()
##########Heterozygosity and inbreeding coefficient per individual
#A couple samples have really (-) Fis, which we should consider removing - indicitative of allelic dropout
#Several also have really high Fis of 1.0, which we should consider removing too.
ind_het <- read_delim("./SNP_OverallStats/final.het", delim = "\t",
col_names = c("ind","ho", "he", "nsites", "f"), skip = 1)
g <- ggplot(ind_het, aes(f)) +
geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 50)
g + theme_light
```
#### VCF Filters summary:
**--remove-indels** - remove all indels (SNPs only)
**--remove top 25% of samples with missing data**
**--maf 0.05** - set minor allele frequency - 0.05 here
**--max-missing 0.5** - set minimum non-missing data. A little counterintuitive - 0 is totally missing, 1 is none missing. Here 0.5 means we will tolerate 50% missing data for a site. LOWERING THIS INCREASES SITES
**--minQ 30** - The minimum quality score required for a site to pass our filtering threshold.
**--min-meanDP 5** - the minimum mean depth for a site. This is what's filtering out a lot of SITES. 10 IDEAL.
**--max-meanDP 80** - the maximum mean depth for a site. 80 might be a bit high. 40 filters out everything
**--minDP 5** - the minimum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype. 10 would be ideal.
**--maxDP 80** - the maximum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype. 40 would be better
**Let's also remove the 25% of samples with most missing data**
:::spoiler rm.ind
```
92-F1-GH-6
125B-F13-BU-6
69-F6-BU-7
92-F1-GH-7
70-71_NF1
125B-F13-BU-5
3-F6-GH-13
52-F5-BU-7
65-F3-BU-4
69-F7-BU-2
92-F1-GH-4
125B-F13-BU-7
70B-F2-GH-20
70B-F2-GH-23
173_Control
70A-F1-GH-17
102-F2-GH-9
109-F3-BU-4
109-F3-BU-7
68-F3-BU-9
69-F6-BU-1
125B-F13-BU-8
170_Control
101-F3-BU-1
125B-F13-BU-3
186_Control
45-F2-GH-1
45-F2-GH-17
60_NF1
70B-F2-GH-24
84_Appo
100-104_NF1
101-F1-GH-10
109-F4-BU-5
44-F2-BU-3
52-F3-GH-8
52-F5-BU-5
60-F7-BU-8
64-F6-BU-5
65-F3-BU-3
70A-F1-GH-4
184_Control
3-F8-BU-2
52-F4-GH-17
60-F5-GH-25
60-F7-BU-1
60-F7-BU-6
69-F7-BU-7
70B-F2-GH-19
91-F1-GH-14
92-F2-BU-2
101-F1-GH-6
102_Appo
109-F4-BU-2
126_Flav
28-F8-BU-2
3-F8-GH-4
4-F7-BU-1
60-F6-BU-1
65-F2-GH-27
91-F1-GH-13
69-F6-BU-5
83-84_NF1
101-F1-GH-13
109-F3-BU-5
34_Appo_Flav
52-F5-BU-6
109-F3-BU-2
185_Control
104-F3-BU-3
91-F2-BU-6
102-F2-GH-20
104-F3-BU-5
66B-F6-BU-8
100_Appo
92-F1-GH-6
8_Flav
102-F2-GH-37
109-F4-BU-3
```
:::
Applying filters
```
#Remove the samples with a lot of missing data
vcftools --vcf final.vcf --remove rm.ind --recode --recode-INFO-all --out subset
vcftools --vcf subset.recode.vcf --remove-indels --maf 0.05 --max-missing 0.50 --minQ 30 --min-meanDP 5 --max-meanDP 80 --minDP 5 --maxDP 40 --recode --recode-INFO-all --out maf.05_missing0.50_minq30_minDP5_maxmeanDP80_noprimerclip.subset
```
Download .vcf and analyze in R. The above filter results in 129 samples across 3 SNPs and 1 loci. Shows flavifrons is greater then Appo.
## Trying GTseq.pl pipeline
obtained from Campbell et. al. 2014 https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12357
```
#Creating conda environments
conda activate repeatseq
conda install git
conda install conda-forge::cmake
#Clone gtseq in chosen directory
git clone https://github.com/GTseq/GTseq-Pipeline.git
#Merge forward and reverse reads
gunzip *.gz
seqtk mergepe *R1.fastq *R2.fastq > combined.fastq
perl /projects/p31939/DENU_Amplicon/snp_gtseq/GTseq-Pipeline/GTseq_Genotyper.pl LocusInfo.csv combined.fastq > combined.genos
```
# Amplicon analysis: Microsats
Paper with pipeline (Pimentel et. al. 2018) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5855144/pdf/fgene-09-00073.pdf
Paper with R analysis pipeline (Perry et. al. 2024) https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13933
## 1.) Alignment of only microsat reads
### Bowtie2 - local alignment
Let's try a local alignment, which uses soft-clipping and is less conservative of an alignment than an end-to-end alignment. Trying this as the microsat sequences we are aligning to are from other Delphinium taxa, plus local alignments may be able to better handle indels (aka microsatellite alleles)
#### Additional resources for reference
Biostar post on local alignments: https://www.biostars.org/p/369638/
Post on dealing with soft-clipping: https://www.biostars.org/p/130374/
Pipeline for re-aligning with BWA instead of bowtie: https://github.com/adaptivegenome/repeatseq
Tutorials on variant calling with samtools: https://angus.readthedocs.io/en/2017/variant-calling.html
https://samtools.sourceforge.net/mpileup.shtml
https://cloud.wikis.utexas.edu/wiki/spaces/bioiteam/pages/47729784/Variant+calling+using+SAMtools
https://wurmlab.com/genomicscourse/2016-SIB/practicals/population_genetics/map_call
#### Creating index
First have a fasta file with just the microsat regions
```
conda create -n bowtie2
conda activate bowtie2
conda install bioconda::bowtie2
conda install bioconda::samtools
conda install bioconda::bcftools
bowtie2-build ./micro_sequences.fa micro_sequences
```
#### Performing local alignment
```
#create output directory
mkdir bowtie_local
```
Script to run for alignment - Note that for samples that are < 1gb, this runs pretty fast. Can take much longer for larger samples. I split my samples into subset directories and ran this script in each folder for efficiency.
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=8:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=8G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))#SBATCH --job-name=bowtie_align1 ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=bwa_align1.error # Enter a error log file name
#SBATCH --output=bwa_align1.out # Enter your out.log file name
module purge
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/bowtie2
INDS=($(for i in ./*R1_paired.fastq.gz; do echo $(basename ${i%_R*}); done))
for IND in ${INDS[@]};
do
# declare variables
FORWARD=./${IND}_R1_paired.fastq.gz
REVERSE=./${IND}_R2_paired.fastq.gz
OUTPUT=./bowtie_local/${IND}_sort.bam
# then align and sort
echo "Aligning $IND with bowtie2"
bowtie2 -p 2 -q --very-sensitive-local -x micro_sequences -1 $FORWARD -2 $REVERSE --rg-id $IND --rg SM:$IND --rg LB:1 --rg PU:${IND}_1 --rg PL:ILLUMINA | samtools view -b | samtools sort -T ${IND} > $OUTPUT
done
```
### Marking Duplicate Reads
Downstream programs seem to expect you to do this. Command also creates index file
```
#Installing GATk4
conda create -n GATk4
conda activate GATk4
conda install -c bioconda gatk4
conda deactivate
conda activate GATk4
cd bowtie_local
mkdir dup_marked
#Marking Duplicates
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk4
INDS=($(for i in ./*sort.bam; do echo $(basename ${i%_sort.bam}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=${IND}_sort.bam
OUTPUT=./dup_marked/${IND}_marked.bam
# then align and sort
echo "Marking Duplicates in $IND"
gatk --java-options "-Xmx8G" MarkDuplicates --MAX_OPTICAL_DUPLICATE_SET_SIZE -1 --CREATE_INDEX true -I $INP -O $OUTPUT -M ${IND}_dup_metrics.txt
done
```
**The two largest files run into core dumping errors when trying to mark duplicates** (102-F2-GH-37_sort.bam,
8_Flav_sort.bam). Not sure how to fix, files dropped.
## 2.) Local re-alignment around indels
These are essentially indels, which can cause errors with microsat variant calling. We want to identify variants in/around microsat region and then re-align.
### Index files
Make sure you have the fasta and it's indexes in this folder
#### SKIP: Index samples
```
#Indexing all samples
conda deactivate
conda activate bowtie2
INDS=($(for i in ./*realigned.bam; do echo $(basename ${i%_r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_realigned.bam
echo "Indexing $IND with samtools"
samtools index $INP
done
```
#### Index fasta
```
conda activate bowtie2
samtools faidx micro_sequences.fa
```
### Samtools Variant Calling - want VCF of only indels
I need to do more research on what filters to apply, but this base code gets you a VCF of variant sites at least
Notes on here: https://www.biostars.org/p/9466154/
```
bcftools mpileup -Ou -d 8000 -x -B -h 500 -m 3 -f micro_sequences.fa *.bam | bcftools call -V snps -mv -Ov -o calls_-d8000_-B_-h500_-m3.vcf
#Shows how many indels were found
grep -c "INDEL" calls.vcf
```
### Realigning reads around indels
GatK has RealignerTargetCreater and IndelRealigner to handle this. Note that these tools were removed from GATk versions after 3.6. So we'll download 3.6 for these tools, and GATk4 for subsequent tools
Tutorial:
https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-tutorials/(howto)_Perform_local_realignment_around_indels.md
#### Setup
```
#Installing GATk3.6
conda create -n GATk
conda activate GATk
conda install -c bioconda gatk=3.6
conda deactivate
#Make a .dict file of reference fasta
conda activate GATk4
gatk CreateSequenceDictionary -R micro_sequences.fa
conda deactivate
```
#### Running Realigner Target Creator
:::spoiler script
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=1:30:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=2G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)$
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=IndelTargetCreator_d4000_B_x_h500_m3.error # Enter a error log file name
#SBATCH --output=IndelTargetCreator_d4000_B_x_h500_m3.out # Enter your out.log file name
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk
gatk3 -T RealignerTargetCreator -R micro_sequences.fa -known calls_-d4000_-B_-x_-h500_-m3_INDEL.vcf \
-I 109-F4-BU-3_marked.bam \
-I 68-F3-BU-2_marked.bam \
-I 34A-F1-GH-9_marked.bam \
-I 69-F7-BU-3_marked.bam \
-I 69-F7-BU-4_marked.bam \
-I 52-F5-BU-3_marked.bam \
-I 91-F2-BU-3_marked.bam \
-I 4-F7-BU-1_marked.bam \
-I 60-F7-BU-2_marked.bam \
-I 68-F3-BU-5_marked.bam \
-I 60-F7-BU-1_marked.bam \
-I 4_Flav_marked.bam \
-I 92-F2-BU-1_marked.bam \
-I 65-F3-BU-2_marked.bam \
-I 68-F3-BU-4_marked.bam \
-I 29_Appo_marked.bam \
-I 69_Flav_marked.bam \
-I 3-F8-BU-2_marked.bam \
-I 69-F6-BU-4_marked.bam \
-I 69-F7-BU-2_marked.bam \
-I 69-F6-BU-3_marked.bam \
-I 60-F6-BU-1_marked.bam \
-I 60-F6-BU-3_marked.bam \
-I 69-F6-BU-1_marked.bam \
-I 68-F3-BU-3_marked.bam \
-I 64-F6-BU-4_marked.bam \
-I 70A_Appo_marked.bam \
-I 65_Flav_marked.bam \
-I 70B_Appo_marked.bam \
-I 3-F8-BU-1_marked.bam \
-I 67_Flav_marked.bam \
-I 44-F1-GH-45_marked.bam \
-I 66_Flav_marked.bam \
-I 68_Flav_marked.bam \
-I 52_Appo_marked.bam \
-I 45_Appo_marked.bam \
-I 69-F7-BU-1_marked.bam \
-I 91_Appo_marked.bam \
-I 86_Appo_marked.bam \
-I 1_Flav_marked.bam \
-I 92_Appo_marked.bam \
-I 28_Appo_marked.bam \
-I 109-F4-BU-2_marked.bam \
-I 101-F3-BU-2_marked.bam \
-I 92-F2-BU-2_marked.bam \
-I 102-F2-GH-7_marked.bam \
-I 63_Flav_marked.bam \
-I 91-F2-BU-4_marked.bam \
-I 104_Appo_marked.bam \
-I 125B-F13-BU-2_marked.bam \
-I 108_Appo_marked.bam \
-I 2_Flav_marked.bam \
-I 28-F10-BU-1_marked.bam \
-I 109_Appo_marked.bam \
-I 3_Flav_marked.bam \
-I 71_Appo_marked.bam \
-I 44-F1-GH-35_marked.bam \
-I 29-F4-GH-29_marked.bam \
-I 52-F3-GH-17_marked.bam \
-I 71-F2-GH-16_marked.bam \
-I 65-F3-BU-1_marked.bam \
-I 52-F5-BU-2_marked.bam \
-I 66B-F6-BU-1_marked.bam \
-I 3-F6-GH-3_marked.bam \
-I 69-F6-BU-2_marked.bam \
-I 91-F1-GH-19_marked.bam \
-I 71-F2-GH-21_marked.bam \
-I 71-F2-GH-25_marked.bam \
-I 45-F2-GH-29_marked.bam \
-I 64_Flav_marked.bam \
-I 52-F5-BU-4_marked.bam \
-I 60-F6-BU-2_marked.bam \
-I 52-F3-GH-11_marked.bam \
-I 44-F2-BU-5_marked.bam \
-I 125A_Flav_marked.bam \
-I 68-F3-BU-7_marked.bam \
-I 28-F8-BU-1_marked.bam \
-I 65-F3-BU-4_marked.bam \
-I 60-F6-BU-4_marked.bam \
-I 65-F3-BU-3_marked.bam \
-I 45-F2-GH-17_marked.bam \
-I 101-F1-GH-6_marked.bam \
-I 64-F6-BU-1_marked.bam \
-I 29-F4-GH-13_marked.bam \
-I 126_Flav_marked.bam \
-I 44-F2-BU-6_marked.bam \
-I 45-F2-GH-22_marked.bam \
-I 69-F5-GH-25_marked.bam \
-I 60-F7-BU-4_marked.bam \
-I 125B_Flav_marked.bam \
-I 4-F7-BU-2_marked.bam \
-I 52-F3-GH-21_marked.bam \
-I 44-F2-BU-4_marked.bam \
-I 4-F5-GH-3_marked.bam \
-I 44_Appo_marked.bam \
-I 3-F8-BU-3_marked.bam \
-I 60_Flav_marked.bam \
-I 125B-F11-GH-19_marked.bam \
-I 45-F2-GH-7_marked.bam \
-I 64-F6-BU-3_marked.bam \
-I 104-F3-BU-1_marked.bam \
-I 103_Appo_marked.bam \
-I 71-F2-GH-28_marked.bam \
-I 70A-F1-GH-2_marked.bam \
-I 125B-F12-BU-1_marked.bam \
-I 44-F2-BU-3_marked.bam \
-I 28-F8-BU-3_marked.bam \
-I 102-F2-GH-16_marked.bam \
-I 66-69_NF1_marked.bam \
-I 109-F4-BU-1_marked.bam \
-I 125B-F10-GH-3_marked.bam \
-I 66B-F6-BU-3_marked.bam \
-I 70B-F2-GH-4_marked.bam \
-I 101-F1-GH-20_marked.bam \
-I 44-45_NF1_marked.bam \
-I 70A-F1-GH-5_marked.bam \
-I 125B-F11-GH-3_marked.bam \
-I 91-F2-BU-9_marked.bam \
-I 165_Control_marked.bam \
-I 44-F1-GH-17_marked.bam \
-I 66B-F6-BU-7_marked.bam \
-I 60-F6-BU-5_marked.bam \
-I 60-F6-BU-7_marked.bam \
-I 69-F7-BU-6_marked.bam \
-I 70A-F1-GH-36_marked.bam \
-I 92-F2-GH-15_marked.bam \
-I 66B-F6-BU-2_marked.bam \
-I 28-F8-BU-4_marked.bam \
-I 69-F6-BU-6_marked.bam \
-I 175_Control_marked.bam \
-I 125B-F11-GH-24_marked.bam \
-I 92-F1-GH-5_marked.bam \
-I 68-F3-BU-6_marked.bam \
-I 125B-F11-GH-20_marked.bam \
-I 104-F3-BU-7_marked.bam \
-I 45-F2-GH-2_marked.bam \
-I 188_Control_marked.bam \
-I 69-F7-BU-5_marked.bam \
-I 70A-F1-GH-7_marked.bam \
-I 3-F6-GH-4_marked.bam \
-I 109-F3-BU-3_marked.bam \
-I 44-F1-GH-30_marked.bam \
-I 104-F2-GH-8_marked.bam \
-I 92-F1-GH-3_marked.bam \
-I 125B-F11-GH-8_marked.bam \
-I 60-F7-BU-9_marked.bam \
-I 183_Control_marked.bam \
-I 1-F7-BU-10_marked.bam \
-I 52-F5-BU-1_marked.bam \
-I 108-109_NF1_marked.bam \
-I 187_Control_marked.bam \
-I 44-F1-GH-18_marked.bam \
-I 109-F4-BU-4_marked.bam \
-I 66B-F6-BU-6_marked.bam \
-I 91-92_NF1_marked.bam \
-I 125-126_NF1_marked.bam \
-I 104-F3-BU-6_marked.bam \
-I 44-F1-GH-21_marked.bam \
-I 44-F2-GH-7_marked.bam \
-I 109-F3-BU-1_marked.bam \
-I 84_Appo_marked.bam \
-I 125B-F12-BU-2_marked.bam \
-I 3-F6-GH-12_marked.bam \
-I 60-F6-BU-6_marked.bam \
-I 91-F1-GH-14_marked.bam \
-I 109-F3-BU-7_marked.bam \
-I 52-F3-GH-15_marked.bam \
-I 92-F1-GH-2_marked.bam \
-I 168_Control_marked.bam \
-I 127_Flav_marked.bam \
-I 28-29_NF1_marked.bam \
-I 29-F4-GH-1_marked.bam \
-I 70B-F2-GH-1_marked.bam \
-I 109-F3-BU-2_marked.bam \
-I 104-F3-BU-8_marked.bam \
-I 125B-F13-BU-8_marked.bam \
-I 45-F2-GH-33_marked.bam \
-I 63-65_NF1_marked.bam \
-I 162_Control_marked.bam \
-I 44-F1-GH-29_marked.bam \
-I 91-F2-BU-5_marked.bam \
-I 172_Control_marked.bam \
-I 70B-F2-GH-23_marked.bam \
-I 60_NF1_marked.bam \
-I 60-F7-BU-6_marked.bam \
-I 8_NF1_marked.bam \
-I 52-F5-BU-5_marked.bam \
-I 86_NF1_marked.bam \
-I 102_Appo_marked.bam \
-I 101-F3-BU-1_marked.bam \
-I 102-F2-GH-9_marked.bam \
-I 184_Control_marked.bam \
-I 104-F3-BU-3_marked.bam \
-I 100-104_NF1_marked.bam \
-I 60-F7-BU-8_marked.bam \
-I 101-F1-GH-23_marked.bam \
-I 109-F3-BU-4_marked.bam \
-I 70B-F3-GH-1_marked.bam \
-I 60-F5-GH-25_marked.bam \
-I 104-F2-GH-44_marked.bam \
-I 104-F3-BU-2_marked.bam \
-I 70A-F1-GH-1_marked.bam \
-I 28-F8-BU-2_marked.bam \
-I 101_Appo_marked.bam \
-I 52_NF1_marked.bam \
-I 125B-F13-BU-7_marked.bam \
-I 68-F3-BU-9_marked.bam \
-I 66B-F7-BU-1_marked.bam \
-I 45-F2-GH-1_marked.bam \
-I 65-F2-GH-27_marked.bam \
-I 2_3_141_142_NF1_marked.bam \
-I 101-F1-GH-13_marked.bam \
-I 70B-F2-GH-19_marked.bam \
-I 91-F1-GH-13_marked.bam \
-I 70A-F1-GH-4_marked.bam \
-I 181_Control_marked.bam \
-I 92-F1-GH-8_marked.bam \
-I 34_NF1_marked.bam \
-I 102-F2-GH-17_marked.bam \
-I 3-F8-GH-4_marked.bam \
-I 7_Flav_marked.bam \
-I 126-F13-GH-11_marked.bam \
-I 101-F1-GH-10_marked.bam \
-I 69-F6-BU-5_marked.bam \
-I 60-F7-BU-5_marked.bam \
-I 186_Control_marked.bam \
-I 70B-F2-GH-24_marked.bam \
-I 28-F9-BU-1_marked.bam \
-I 173_Control_marked.bam \
-I 104-F2-GH-7_marked.bam \
-I 109-F4-BU-5_marked.bam \
-I 70B-F2-GH-33_marked.bam \
-I 52-F4-GH-17_marked.bam \
-I 64-F6-BU-5_marked.bam \
-I 3-F6-GH-13_marked.bam \
-I 69-F7-BU-7_marked.bam \
-I 170_Control_marked.bam \
-I 104-F3-BU-5_marked.bam \
-I 2-F2-GH-7_marked.bam \
-I 66B-F6-BU-4_marked.bam \
-I 69-F6-BU-7_marked.bam \
-I 52-F3-GH-8_marked.bam \
-I 92-F1-GH-7_marked.bam \
-I 70B-F2-GH-20_marked.bam \
-I 185_Control_marked.bam \
-I 91-F2-BU-6_marked.bam \
-I 125B-F11-GH-16_marked.bam \
-I 91-F2-BU-7_marked.bam \
-I 52-F5-BU-6_marked.bam \
-I 125B-F13-BU-1_marked.bam \
-I 66B-F6-BU-8_marked.bam \
-I 109-F3-BU-5_marked.bam \
-I 83-84_NF1_marked.bam \
-I 102-F2-GH-20_marked.bam \
-I 100_Appo_marked.bam \
-I 70-71_NF1_marked.bam \
-I 92-F1-GH-4_marked.bam \
-I 92-F1-GH-6_marked.bam \
-I 125B-F13-BU-6_marked.bam \
-I 125B-F13-BU-5_marked.bam \
-I 34_Appo_Flav_marked.bam \
-I 52-F5-BU-7_marked.bam \
-I 70A-F1-GH-17_marked.bam \
-o realignertargetcreator_d4000_B_x_h500_m3_INDEL.intervals
```
:::
```
#make new directory for re-aligned files
mkdir indel_realigned_d8000_B_x_h500_m3
```
#### Realign reads using IndelRealigner
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=10G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=indel_realigned_d8000_B_x_h500_m3.error # Enter a error log file name
#SBATCH --output=indel_realigned_d8000_B_x_h500_m3.out # Enter your out.log file name
eval "$(conda shell.bash hook)"
conda activate /home/bjc8165/.conda/envs/GATk
INDS=($(for i in ./*marked.bam; do echo $(basename ${i%_m*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_marked.bam
OUTPUT=./indel_realigned_d8000_B_x_h500_m3/${IND}_realigned.bam
echo "Realigning $IND with gatk3"
gatk3 -Xmx8G -T IndelRealigner \
-R micro_sequences.fa \
-targetIntervals realignertargetcreator_d8000_B_x_h500_m3.intervals \
-known calls_-d8000_-B_-x_-h500_-m3_INDEL.vcf \
-I $INP \
-o $OUTPUT
done
```
The following files were truncated when doing realignement. Not sure how to fix. IndelRealigner has some options like -maxReadsInMemory and -xmx but I can't get it to run with maxreadsinmemory
```
4-F7-BU-1_*
52-F5-BU-3_*
60-F7-BU-1_*
60-F7-BU-2_*
68-F3-BU-2_*
68-F3-BU-5_*
69-F7-BU-3_*
69-F7-BU-4_*
```
## 3.) Variant Calling - RepeatSeq
package: https://github.com/adaptivegenome/repeatseq
Repeat Seq citation - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592458/pdf/gks981.pdf
Highnam, G., Franck, C., Martin, A., Stephens, C., Puthige, A., and Mittelman, D.
(2012). Accurate human microsatellite genotypes from high-throughput
resequencing data using informed error profiles. Nucleic Acids Res. 41:e32.
doi: 10.1093/nar/gks981
### Installing RepeatSeq
```
#Creating conda environments
conda create -n repeatseq
conda activate repeatseq
conda install git
conda install conda-forge::cmake
#Clone repeatseq in chosen directory
git clone https://github.com/adaptivegenome/repeatseq
cd repeatseq
git clone https://github.com/pezmaster31/bamtools.git
cd bamtools/
git checkout 9cfa70b
cd ..
git clone https://github.com/ekg/fastahack.git
mkdir bamtools/build
cd bamtools/build/
cmake ..
make
#build repeatseq
cd ../..
make
```
### Running RepeatSeq
First we need to make an .regions file, which is a tab seperated file, with first column being "lociName:coordinates" and second being "_RepeatMotif"
Example:
```
KX377703.1:86-107 _AG
KX377702.1:208-224 _CA
KJ733933.1:149-162 _TC
KJ733931.1:150-164 _CT
KJ733930.1:160-182 _AC
KT282132.1:45-61 _CA
KT282133.1:67-83 _GT
KT282134.1:41-57 _GT
KT282135.1:68-86 _AG
KT282136.1:47-63 _CT
KT282137.1:87-103 _AG
KT282138.1:78-95 _TC
KT282139.1:136-152 _AG
KT282140.1:250-269 _TG
KT282141.1:49-84 _GA
```
#### RepeatSeq Prep
```
#Set LD_LIBRARY_PATH to bamtools folder
export LD_LIBRARY_PATH="/projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/bamtools/lib"
#Copy Fasta File to folder with re-aligned BAMs
cp micro_sequences.* ./indel_realigned
#Change index file names
INDS=($(for i in ./*realigned.bai; do echo $(basename ${i%_r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_realigned.bai
OUTPUT=${IND}_realigned.bam.bai
mv $INP $OUTPUT
done
```
#### Run repeatseq
This script makes it so that reads must have **EXACT matches with 8bp before and after repeat sequence**
```
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=4:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=15G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=repeatseq_-1flanking_8before_8after.error # Enter a error log file name
#SBATCH --output=repeatseq_-1flanking_8before_8after.out # Enter your out.log file name
export LD_LIBRARY_PATH="/projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/bamtools/lib"
INDS=($(for i in ./*realigned.bam; do echo $(basename ${i%_r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_realigned.bam
echo "RepeatSeq on $IND"
/projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/repeatseq -L 8 -R 8 -calls -repeatseq $INP micro_sequences.fa exact.regions
done
#reset LD_LIBRARY_PATH when done
unset LD_LIBRARY_PATH
```
## 4.) Variant Filtering
The .repeatseq file contains all reads that matched to repeat motif. But we need to do some manual filtering to turn this into usable data.
### Filtering we want to do (Pimentel et. al. 2018)
1.) Eliminate everything but 2 most common alleles per loci per sample
2.) Remove alleles with < 10 reads
3.) Genotype samples. If minor allele is less than 20% of reads, then sample is homozygous
4.) Remove loci and samples with > 50% missing data
### 4a.) Create file summarizing alleles and their coverage
:::spoiler extract_allele_coverage_pt1.sh script
```
#!/bin/bash
# Check if at least one file name is provided as an argument
if [ $# -eq 0 ]; then
echo "Usage: $0 <repeatseq_file1> [repeatseq_file2 ...]"
exit 1
fi
# Function to process a single file
process_file() {
local repeatseq_file=$1
local output_file="${repeatseq_file%.repeatseq}_allele_coverage_summary.txt"
# Initialize or clear the output file
> "$output_file"
# Function to extract allele information and coverage from the header line
extract_allele_coverage() {
local header_line="$1"
local start_line="$2"
# Extract loci name (first character string after ~)
local loci_name=$(echo "$header_line" | awk '{print $1}' | sed 's/~//')
# Extract genotype (GT field)
local genotype_info=$(echo "$header_line" | grep -oP 'GT:\K[^ ]+')
# Extract allele information (A field)
local allele_info=$(echo "$header_line" | grep -oP 'A:\K[^C]+')
# Print loci name, genotype, and allele information
echo "Loci: $loci_name" >> "$output_file"
echo "Genotype: $genotype_info" >> "$output_file"
# Process each allele and its coverage
echo "$allele_info" | tr ' ' '\n' | while IFS= read -r allele; do
# Extract allele and coverage from the format "allele[coverage]"
local allele_name=$(echo "$allele" | grep -oP '^[0-9]+')
local coverage=$(echo "$allele" | grep -oP '\[\K[0-9]+(?=\])')
if [ -z "$coverage" ]; then
# Coverage value is missing, calculate it
coverage=$(awk -v start="$start_line" 'NR > start && /^~/{exit} NR > start {count++} END{print count-1}' "$repeatseq_file")
fi
if [ -n "$allele_name" ]; then
echo "Allele: $allele_name, Coverage: $coverage" >> "$output_file"
fi
done
echo "" >> "$output_file" # Add a blank line for separation
}
# Track line number and process each record in the file
local line_number=0
while IFS= read -r line; do
((line_number++))
if [[ "$line" =~ ^~ ]]; then
# Process header line
header_line="${line#~}"
extract_allele_coverage "$header_line" "$line_number"
elif [[ "$line" =~ ^T|^CCG|^GTG ]]; then
# Process read lines
continue
else
# Skip non-header and non-read lines (e.g., reference sequence)
continue
fi
done < "$repeatseq_file"
echo "Allele coverage has been extracted to $output_file"
}
# Loop through each provided file and process it
for repeatseq_file in "$@"; do
if [ ! -f "$repeatseq_file" ]; then
echo "File not found: $repeatseq_file"
continue
fi
process_file "$repeatseq_file"
done
```
:::
```
#Make script executable
chmod +x ./extract_allele_coverage_pt1.sh
#Execute it on all files in directory
./extract_allele_coverage_pt1.sh *.repeatseq
```
### 4b.) Keep only two most common alleles epr loci per sample
:::spoiler filter_2alleles_pt2.sh
```
#!/bin/bash
# Function to process the file and filter alleles
process_file() {
local input_file="$1"
local base_name=$(basename "$input_file" "_realigned.bam.L8.R8_allele_coverage_summary.txt")
local output_file="${base_name}.filtered_2alleles.txt"
# Initialize the output file
> "$output_file"
local in_loci_section=0
local allele_count=0
# Read the input file line by line
while IFS= read -r line; do
if [[ $line == Loci:* ]]; then
# If in a loci section, reset the allele count
if [[ $in_loci_section -eq 1 ]]; then
true
fi
# Start a new loci section
in_loci_section=1
allele_count=0
echo "$line" >> "$output_file"
elif [[ $line == Genotype:* ]]; then
# Write Genotype line to the output file
echo "$line" >> "$output_file"
elif [[ $line == Allele:* ]]; then
if [[ $in_loci_section -eq 1 ]]; then
# Only write the first two alleles
if [[ $allele_count -lt 2 ]]; then
echo "$line" >> "$output_file"
fi
((allele_count++))
fi
else
# Write other lines (if any) to the output file
echo "$line" >> "$output_file"
fi
done < "$input_file"
echo "Filtered file created: $output_file"
}
# Check if at least one argument is provided
if [ "$#" -lt 1 ]; then
echo "Usage: $0 pattern.txt"
exit 1
fi
# Process each file matching the pattern
for file in "$@"; do
for f in $file; do
if [ -f "$f" ]; then
process_file "$f"
else
echo "File $f not found!"
fi
done
done
```
:::
```
#Make script executable
chmod +x ./filter_2alleles_pt2.sh
#Execute it on all files in directory
./filter_2alleles_pt2.sh *_allele_coverage_summary.txt
```
### 4c.) Remove alleles with coverage less than 10
:::spoiler filter_alleles_10count_pt3.sh
```
#!/bin/bash
# Function to filter alleles with coverage of 10 or more
filter_coverage() {
local input_file="$1"
local base_name=$(basename "$input_file" ".filtered_2alleles.txt")
local output_file="${base_name}.filtered_2alleles_10count.txt"
# Initialize the output file
> "$output_file"
local in_loci_section=0
local allele_count=0
# Read the input file line by line
while IFS= read -r line; do
if [[ $line == Loci:* ]]; then
# Start a new loci section
in_loci_section=1
allele_count=0
echo "$line" >> "$output_file"
elif [[ $line == Genotype:* ]]; then
# Write Genotype line to the output file
echo "$line" >> "$output_file"
elif [[ $line == Allele:* ]]; then
if [[ $in_loci_section -eq 1 ]]; then
# Extract Coverage value
local coverage=$(echo "$line" | awk -F"Coverage: " '{print $2}' | awk '{print $1}')
# Only write alleles with coverage of 10 or more
if [ "$coverage" -ge 10 ]; then
echo "$line" >> "$output_file"
fi
fi
else
# Write other lines (if any) to the output file
echo "$line" >> "$output_file"
fi
done < "$input_file"
echo "Coverage filtered file created: $output_file"
}
# Check if at least one argument is provided
if [ "$#" -lt 1 ]; then
echo "Usage: $0 pattern.txt"
exit 1
fi
# Process each file matching the pattern
for file in "$@"; do
for f in $file; do
if [ -f "$f" ]; then
filter_coverage "$f"
else
echo "File $f not found!"
fi
done
done
```
:::
```
#Make script executable
chmod +x ./filter_alleles_10count_pt3.sh
#Execute it on all files in directory
./filter_alleles_10count_pt3.sh *.filtered_2alleles.txt
```
### 4d.) Remove alleles that have <20% of total coverage - i.e. mark as homozygous
:::spoiler filter_alleles_20percent_pt4.sh
```
#!/bin/bash
# Iterate over each *.filtered_2alleles_10count.txt file in the directory
for input_file in *.filtered_2alleles_10count.txt; do
# Define a temporary output file
output_file="${input_file%.txt}_20%.txt"
# Initialize variables
current_loci=""
total_coverage=0
coverage_threshold=0
allele_lines=()
# Process the input file
while IFS= read -r line; do
if [[ "$line" =~ ^Loci: ]]; then
# When encountering a new locus, process the previous locus if it exists
if [[ -n "$current_loci" ]]; then
# Write the locus and genotype to the output file
echo "$current_loci" >> "$output_file"
echo "$genotype_line" >> "$output_file"
# Filter and write allele lines based on the coverage threshold
for allele_line in "${allele_lines[@]}"; do
coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}')
if [[ "$coverage" -ge "$coverage_threshold" ]]; then
echo "$allele_line" >> "$output_file"
fi
done
# Reset variables for the next locus
total_coverage=0
coverage_threshold=0
allele_lines=()
fi
# Update current locus
current_loci=$(echo "$line" | sed 's/Loci: //')
genotype_line=""
elif [[ "$line" =~ ^Genotype: ]]; then
# Store genotype line
genotype_line="$line"
elif [[ "$line" =~ ^Allele: ]]; then
# Process allele line to extract coverage
allele_lines+=("$line")
coverage=$(echo "$line" | awk -F', Coverage: ' '{print $2}')
total_coverage=$((total_coverage + coverage))
elif [[ -z "$line" ]]; then
# Handle empty lines (end of a locus block)
if [[ -n "$current_loci" ]]; then
# Calculate coverage threshold (20%)
coverage_threshold=$((total_coverage * 20 / 100))
# Write the locus and genotype to the output file
echo "$current_loci" >> "$output_file"
echo "$genotype_line" >> "$output_file"
# Filter and write allele lines based on the coverage threshold
for allele_line in "${allele_lines[@]}"; do
coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}')
if [[ "$coverage" -ge "$coverage_threshold" ]]; then
echo "$allele_line" >> "$output_file"
fi
done
# Reset variables for the next locus
current_loci=""
total_coverage=0
coverage_threshold=0
allele_lines=()
fi
fi
done < "$input_file"
# Handle the last locus in the file if not followed by an empty line
if [[ -n "$current_loci" ]]; then
coverage_threshold=$((total_coverage * 20 / 100))
echo "$current_loci" >> "$output_file"
echo "$genotype_line" >> "$output_file"
for allele_line in "${allele_lines[@]}"; do
coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}')
if [[ "$coverage" -ge "$coverage_threshold" ]]; then
echo "$allele_line" >> "$output_file"
fi
done
fi
done
```
:::
```
#Make script executable
chmod +x ./filter_alleles_20percent_pt4.sh
#Execute it on all files in directory
./filter_alleles_20percent_pt4.sh
```
### 4e.) Changes final genotype
:::spoiler filter_alleles_changegeno_pt5.sh
```
#!/bin/bash
# Process each file matching the pattern this is a temporary file that contains duplicate genotype rows
for file in *.filtered_2alleles_10count_20%.txt; do
# Create a new file to store the processed data
output_file="${file%.txt}_updated.txt"
# Initialize variables
prev_line=""
current_line=""
last_line=""
# Process each line of the input file
while IFS= read -r current_line; do
# Reset $*numerics for the current line
first_numeric=""
second_numeric=""
# Check if the line is a loci row
if [[ $current_line != "Allele:"* && $current_line != "Genotype:"* ]]; then
# If the previous line is a genotype row, then print it as "NA:NA"
if [[ $prev_line == "Genotype:"* ]]; then
echo "Genotype: NA:NA" >> "$output_file"
echo "Loci: $current_line" >> "$output_file"
else #if previous row was an allele row, then just print the loci
echo "Loci: $current_line" >> "$output_file"
fi
# Check if current row is an allele row
elif [[ $current_line == "Allele:"* ]]; then
# Extract the alleles using grep
first_numeric=$(echo "$current_line" | grep -oE '[0-9]+' | head -n 1)
second_numeric=$(echo "$prev_line" | grep -oE '[0-9]+' | head -n 1)
# Check if previous row was an allele row, aka loci if heterozygous.
if [[ $prev_line == "Allele:"* ]]; then
echo "Genotype: $second_numeric:$first_numeric" >> "$output_file"
else
echo "Genotype: $first_numeric:$first_numeric" >> "$output_file"
fi
fi
# Update the previous line to the current line for the next iteration
prev_line="$current_line"
done < "$file"
#Define last_line as the last line of the file
while IFS= read -r line; do
last_line="$line"
done < "$file"
#Print last line of input file if it starts with Genotype: (i.e. missing allele)
if [[ $last_line == "Genotype:"* ]]; then
echo "Genotype: NA:NA" >> "$output_file"
fi
echo "Updated file saved as $output_file"
done
#We need to clean-up the final file
for file in *_updated.txt; do
# Create a new file to store the processed data
base_name=$(basename "$file" ".filtered_2alleles_10count_20%_updated.txt")
output_file2="${base_name}_finalgeno.txt"
# Initialize variables
prev_line=""
current_line=""
# Process each line of the input file
while IFS= read -r current_line; do
#Check if line is a loci line
if [[ $current_line != "Allele:"* && $current_line != "Genotype:"* ]]; then
echo "$prev_line" >> "$output_file2"
echo "$current_line" >> "$output_file2"
fi
# Update the previous line to the current line for the next iteration
prev_line="$current_line"
done < "$file"
# Print the last line of file into output
tail -n 1 "$file" >> "$output_file2"
echo "Updated file saved as $output_file2"
done
#removes the first temp files from the script
rm *updated.txt
```
:::
```
#Make script executable
chmod +x ./filter_alleles_changegeno_pt5.sh
#Execute it on all files in directory
./filter_alleles_changegeno_pt5.sh *.filtered_2alleles_10count_20%.txt
```
### 4e.) Output final .csv
:::spoiler create_csv.sh
```
#!/bin/bash
# Define output CSV file
output_csv="combined_genotypes.csv"
# Temporary files
temp_loci_file=$(mktemp)
temp_genotypes_file=$(mktemp)
# Initialize the output CSV file
echo "" > "$output_csv"
# Collect all unique loci
for txt_file in *_finalgeno.txt; do
awk '
BEGIN { FS=": "; }
$1 == "Loci" {
loci = $2;
if (loci !~ /^[[:space:]]*$/) {
loci_list[loci] = "";
}
}
END {
for (loci in loci_list) {
print loci;
}
}
' "$txt_file" >> "$temp_loci_file"
done
# Create unique list of loci
sort -u "$temp_loci_file" > "$temp_loci_file.sorted"
loci_list=$(cat "$temp_loci_file.sorted")
# Add loci as headers to the output CSV
echo -n "Sample," > "$output_csv"
echo "$loci_list" | tr '\n' ',' >> "$output_csv"
echo "" >> "$output_csv"
# Debugging: Print the list of loci headers
echo "Loci headers: $(cat "$temp_loci_file.sorted")" >&2
# Populate the CSV with genotypes
for txt_file in *_finalgeno.txt; do
sample_name=$(basename "$txt_file" _finalgeno.txt)
# Initialize an array for genotypes
genotype_row=()
# Debugging: Print current sample being processed
echo "Processing sample: $sample_name" >&2
# Loop through each loci to find genotypes
while IFS=, read -r loci; do
genotype=$(awk -F, -v loci="$loci" '
BEGIN { FS=": "; }
$1 == "Loci" {
current_loci = $2;
}
$1 == "Genotype" && current_loci == loci {
print $2;
exit;
}
END {
# print "NA";
}
' "$txt_file")
# Debugging: Print the locus and genotype found
echo "Locus: $loci, Genotype: $genotype" >&2
genotype_row+=("$genotype")
done < "$temp_loci_file.sorted"
# Debugging: Print the genotype row for the current sample
echo "Genotype row for $sample_name: ${genotype_row[@]}" >&2
# Append the sample row to the CSV
echo -n "$sample_name," >> "$output_csv"
echo "${genotype_row[@]}" | tr ' ' ',' >> "$output_csv"
done
# Clean up temporary files
rm -f "$temp_loci_file" "$temp_loci_file.sorted"
```
:::
```
#Make script executable
chmod +x ./create_csv.sh
#Execute it on all files in directory
./create_csv.sh
```
download .csv file and analyze in R
## Notes 8/18
### Things to look into:
1.) bcftools mpileup -d 8000 : What to set -d to. Should I lower it. What about other filters. Resource: https://www.biostars.org/p/9466154/
2.) bcftools mpileup: Do I want to specify search region (i.e. before repeat motif) or leave this blank? Maybe try on large subset of samples
3.) bcftools mpileup: Should I just search for indels here? Yes! Want to filter .VCF for just indels
4.) Realigner Target Creator: Do I want to run this with all files. Gut says yes
5.) indelrealigner - some loci have too many reads. how to deal?
6.) Repeatseq .regions file. Include more microsats? Mess with motifs
## PopGen Analysis in R
### R Tutoritals: https://tomjenkins.netlify.app/tutorials/r-popgen-getting-started/
https://julie-hopper.com/2018/12/08/population-genetics-part-iii-data-analysis-data-wrangling-interpretation-and-implications/
:::spoiler IGNORE: How to merge vcfs
### Preparing VCF for R import
First we need to merge all VCF files (repeatseq does 1 per individual) into one using vcftools.
#### Installing vcftools
```
conda create -n vcftools
conda activate vcftools
conda install bioconda::vcftools
conda install bioconda::samtools
conda install bioconda::tabix
```
#### Merging VCF
vcf-merge requires bgzipped and tabix indexed VCF files on input. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz)
##### Compressing
```
conda activate bowtie2
INDS=($(for i in ./*realigned.bam.vcf; do echo $(basename ${i%_r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}_realigned.bam.vcf
OUT=./${IND}.vcf.gz
echo "Compressing $IND"
bgzip -c $INP > $OUT ; tabix -p vcf $OUT
done
```
##### Putting sample name in vcf
```
conda activate bowtie2
INDS=($(for i in ./*vcf.gz; do echo $(basename ${i%.v*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}.vcf.gz
OUT=./${IND}.renamed.vcf.gz
echo "renaming sample $IND"
echo $IND > ${IND}.txt
bcftools reheader -s ${IND}.txt $INP -o $OUT
rm ${IND}.txt
done
```
##### Indexing
```
INDS=($(for i in ./*renamed.vcf.gz; do echo $(basename ${i%.r*}); done))
for IND in ${INDS[@]};
do
# declare variables
INP=./${IND}.renamed.vcf.gz
echo "indexing $IND"
tabix -p vcf $INP
done
```
#### Merging
Don't include empty VCF's in the following command. Use ls -lhS to identify empty VCF's by their small file size
```
bcftools merge 109-F4-BU-4.renamed.vcf.gz \
66B-F6-BU-6.renamed.vcf.gz \
68-F3-BU-6.renamed.vcf.gz \
69-F7-BU-5.renamed.vcf.gz \
91-F1-GH-14.renamed.vcf.gz \
109-F3-BU-3.renamed.vcf.gz \
70A-F1-GH-7.renamed.vcf.gz \
44-F1-GH-30.renamed.vcf.gz \
44-F1-GH-18.renamed.vcf.gz \
84_Appo.renamed.vcf.gz \
104-F3-BU-7.renamed.vcf.gz \
125B-F12-BU-1.renamed.vcf.gz \
29-F4-GH-1.renamed.vcf.gz \
28-F8-BU-1.renamed.vcf.gz \
109-F3-BU-1.renamed.vcf.gz \
104-F2-GH-8.renamed.vcf.gz \
65-F3-BU-4.renamed.vcf.gz \
125-126_NF1.renamed.vcf.gz \
63-65_NF1.renamed.vcf.gz \
125B-F12-BU-2.renamed.vcf.gz \
125B_Flav.renamed.vcf.gz \
44-F2-BU-4.renamed.vcf.gz \
69-F6-BU-6.renamed.vcf.gz \
125B-F11-GH-24.renamed.vcf.gz \
91-92_NF1.renamed.vcf.gz \
92-F1-GH-2.renamed.vcf.gz \
1-F7-BU-10.renamed.vcf.gz \
45-F2-GH-22.renamed.vcf.gz \
70A-F1-GH-36.renamed.vcf.gz \
69-F5-GH-25.renamed.vcf.gz \
92-F1-GH-5.renamed.vcf.gz \
104-F2-GH-44.renamed.vcf.gz \
28-29_NF1.renamed.vcf.gz \
125B-F13-BU-3.renamed.vcf.gz \
29-F4-GH-13.renamed.vcf.gz \
44-F2-GH-7.renamed.vcf.gz \
66B-F6-BU-7.renamed.vcf.gz \
44-F1-GH-21.renamed.vcf.gz \
60-F7-BU-9.renamed.vcf.gz \
70B-F2-GH-1.renamed.vcf.gz \
60-F6-BU-6.renamed.vcf.gz \
28-F8-BU-4.renamed.vcf.gz \
60-F6-BU-5.renamed.vcf.gz \
64-F6-BU-3.renamed.vcf.gz \
104-F3-BU-6.renamed.vcf.gz \
101-F1-GH-10.renamed.vcf.gz \
52-F3-GH-15.renamed.vcf.gz \
108-109_NF1.renamed.vcf.gz \
92-F1-GH-3.renamed.vcf.gz \
109-F3-BU-2.renamed.vcf.gz \
188_Control.renamed.vcf.gz \
64-F6-BU-1.renamed.vcf.gz \
3-F6-GH-12.renamed.vcf.gz \
44-F2-BU-5.renamed.vcf.gz \
60-F5-GH-25.renamed.vcf.gz \
127_Flav.renamed.vcf.gz \
71-F2-GH-28.renamed.vcf.gz \
7_Flav.renamed.vcf.gz \
168_Control.renamed.vcf.gz \
44-F2-BU-3.renamed.vcf.gz \
109-F3-BU-4.renamed.vcf.gz \
70A-F1-GH-1.renamed.vcf.gz \
172_Control.renamed.vcf.gz \
60-F6-BU-7.renamed.vcf.gz \
69-F7-BU-6.renamed.vcf.gz \
45-F2-GH-17.renamed.vcf.gz \
109-F3-BU-7.renamed.vcf.gz \
52-F5-BU-4.renamed.vcf.gz \
3-F8-GH-4.renamed.vcf.gz \
66B-F6-BU-3.renamed.vcf.gz \
100-104_NF1.renamed.vcf.gz \
45-F2-GH-29.renamed.vcf.gz \
45-F2-GH-7.renamed.vcf.gz \
44-F1-GH-17.renamed.vcf.gz \
104-F3-BU-3.renamed.vcf.gz \
125B-F11-GH-8.renamed.vcf.gz \
125B-F13-BU-1.renamed.vcf.gz \
92-F1-GH-8.renamed.vcf.gz \
66B-F6-BU-2.renamed.vcf.gz \
187_Control.renamed.vcf.gz \
186_Control.renamed.vcf.gz \
102-F2-GH-20.renamed.vcf.gz \
52-F3-GH-8.renamed.vcf.gz \
60-F7-BU-8.renamed.vcf.gz \
68-F3-BU-9.renamed.vcf.gz \
125B-F13-BU-7.renamed.vcf.gz \
175_Control.renamed.vcf.gz \
101-F1-GH-13.renamed.vcf.gz \
125B-F11-GH-20.renamed.vcf.gz \
126-F13-GH-11.renamed.vcf.gz \
183_Control.renamed.vcf.gz \
101-F1-GH-23.renamed.vcf.gz \
70B-F2-GH-23.renamed.vcf.gz \
2_3_141_142_NF1.renamed.vcf.gz \
65-F2-GH-27.renamed.vcf.gz \
104-F3-BU-8.renamed.vcf.gz \
28-F8-BU-3.renamed.vcf.gz \
91-F2-BU-5.renamed.vcf.gz \
70B-F2-GH-33.renamed.vcf.gz \
102_Appo.renamed.vcf.gz \
3-F6-GH-13.renamed.vcf.gz \
34_Appo_Flav.renamed.vcf.gz \
102-F2-GH-17.renamed.vcf.gz \
162_Control.renamed.vcf.gz \
181_Control.renamed.vcf.gz \
45-F2-GH-1.renamed.vcf.gz \
52-F5-BU-5.renamed.vcf.gz \
3-F6-GH-4.renamed.vcf.gz \
66B-F7-BU-1.renamed.vcf.gz \
109-F3-BU-5.renamed.vcf.gz \
125B-F13-BU-8.renamed.vcf.gz \
173_Control.renamed.vcf.gz \
34_NF1.renamed.vcf.gz \
66-69_NF1.renamed.vcf.gz \
66B-F6-BU-8.renamed.vcf.gz \
44-F1-GH-29.renamed.vcf.gz \
52-F4-GH-17.renamed.vcf.gz \
126_Flav.renamed.vcf.gz \
70B-F3-GH-1.renamed.vcf.gz \
2-F2-GH-7.renamed.vcf.gz \
91-F2-BU-9.renamed.vcf.gz \
71-F2-GH-25.renamed.vcf.gz \
91-F1-GH-19.renamed.vcf.gz \
101-F3-BU-1.renamed.vcf.gz \
170_Control.renamed.vcf.gz \
4-F7-BU-2.renamed.vcf.gz \
60-F7-BU-5.renamed.vcf.gz \
65-F3-BU-3.renamed.vcf.gz \
104-F2-GH-7.renamed.vcf.gz \
60_Flav.renamed.vcf.gz \
52-F3-GH-17.renamed.vcf.gz \
103_Appo.renamed.vcf.gz \
71-F2-GH-16.renamed.vcf.gz \
101-F1-GH-6.renamed.vcf.gz \
70B-F2-GH-4.renamed.vcf.gz \
104-F3-BU-1.renamed.vcf.gz \
104-F3-BU-5.renamed.vcf.gz \
52_NF1.renamed.vcf.gz \
44-F2-BU-6.renamed.vcf.gz \
70A-F1-GH-5.renamed.vcf.gz \
125B-F11-GH-16.renamed.vcf.gz \
165_Control.renamed.vcf.gz \
66_Flav.renamed.vcf.gz \
28-F10-BU-1.renamed.vcf.gz \
3-F6-GH-3.renamed.vcf.gz \
45-F2-GH-2.renamed.vcf.gz \
52-F3-GH-21.renamed.vcf.gz \
45-F2-GH-33.renamed.vcf.gz \
60_NF1.renamed.vcf.gz \
91-F2-BU-4.renamed.vcf.gz \
8_NF1.renamed.vcf.gz \
70-71_NF1.renamed.vcf.gz \
125B-F13-BU-6.renamed.vcf.gz \
109-F4-BU-3.renamed.vcf.gz \
69-F6-BU-4.renamed.vcf.gz \
92-F2-GH-15.renamed.vcf.gz \
102-F2-GH-9.renamed.vcf.gz \
44-F1-GH-35.renamed.vcf.gz \
64-F6-BU-5.renamed.vcf.gz \
109-F4-BU-1.renamed.vcf.gz \
185_Control.renamed.vcf.gz \
52-F5-BU-7.renamed.vcf.gz \
184_Control.renamed.vcf.gz \
125B-F10-GH-3.renamed.vcf.gz \
100_Appo.renamed.vcf.gz \
125B-F13-BU-5.renamed.vcf.gz \
52-F5-BU-6.renamed.vcf.gz \
70A-F1-GH-4.renamed.vcf.gz \
101_Appo.renamed.vcf.gz \
28-F8-BU-2.renamed.vcf.gz \
91-F2-BU-6.renamed.vcf.gz \
92-F1-GH-6.renamed.vcf.gz \
102-F2-GH-16.renamed.vcf.gz \
101-F1-GH-20.renamed.vcf.gz \
70B-F2-GH-20.renamed.vcf.gz \
91-F2-BU-7.renamed.vcf.gz \
69-F6-BU-7.renamed.vcf.gz \
70B-F2-GH-24.renamed.vcf.gz \
91_Appo.renamed.vcf.gz \
65-F3-BU-1.renamed.vcf.gz \
69-F7-BU-7.renamed.vcf.gz \
109-F4-BU-5.renamed.vcf.gz \
125B-F11-GH-3.renamed.vcf.gz \
3-F8-BU-3.renamed.vcf.gz \
60-F6-BU-4.renamed.vcf.gz \
63_Flav.renamed.vcf.gz \
91-F1-GH-13.renamed.vcf.gz \
92-F1-GH-4.renamed.vcf.gz \
102-F2-GH-7.renamed.vcf.gz \
28-F9-BU-1.renamed.vcf.gz \
70B-F2-GH-19.renamed.vcf.gz \
44-45_NF1.renamed.vcf.gz \
60-F7-BU-6.renamed.vcf.gz \
52-F5-BU-1.renamed.vcf.gz \
83-84_NF1.renamed.vcf.gz \
92-F1-GH-7.renamed.vcf.gz \
71_Appo.renamed.vcf.gz \
66B-F6-BU-4.renamed.vcf.gz \
70A-F1-GH-2.renamed.vcf.gz \
91-F2-BU-3.renamed.vcf.gz \
70A_Appo.renamed.vcf.gz \
34A-F1-GH-9.renamed.vcf.gz \
69-F6-BU-5.renamed.vcf.gz \
52-F3-GH-11.renamed.vcf.gz \
60-F6-BU-3.renamed.vcf.gz \
125B-F13-BU-2.renamed.vcf.gz \
52-F5-BU-2.renamed.vcf.gz \
67_Flav.renamed.vcf.gz \
-o merged.vcf
# If you want to compress it
bgzip merged.vcf > merged.vcf.gz
```
:::
### R: VCF -> GENind -> GENpop
We can now download the merged vcf file, import it into R, and using some packages turn it into a genepop.
##### Load Packages
```
library(adegenet)
library(poppr)
library(dplyr)
library(hierfstat)
library(reshape2)
library(ggplot2)
library(RColorBrewer)
library(scales)
library(vcfR)
```
##### Turn into Genind
When importing, may want to try convertNA(), which converts missing data from "." to "NA". See vcfR manual, pg 48 https://cran.r-project.org/web/packages/vcfR/vcfR.pdf
For users of poppr: If you wish to use vcfR2genind(), it is strongly recommended to use
it with the option return.alleles = TRUE. The reason for this is because the poppr package
accomodates mixed-ploidy data by interpreting "0" alleles in genind objects to be NULL alleles
in both poppr::poppr.amova() and poppr::locus_table().
```
#Import VCF
d8000_merged.vcf<-read.vcfR("merged.vcf", checkFile = TRUE,
check_keys = TRUE,
verbose = TRUE
)
# Turn into genind
d8000_gid <- vcfR2genind(d8000_merged.vcf, sep = "[|/]", return.alleles = TRUE)
```
##### Assigning populations
Make a tab de-lineated .txt file with sample in one column, and population in the other. example:
```
109-F4-BU-4 Bombus appositus
66B-F6-BU-6 Bombus flavifrons
68-F3-BU-6 Bombus flavifrons
69-F7-BU-5 Bombus flavifrons
91-F1-GH-14 Bombus appositus
109-F3-BU-3 Bombus appositus
70A-F1-GH-7 Bombus appositus
44-F1-GH-30 Bombus appositus
44-F1-GH-18 Bombus appositus
84_Appo parent
```
```
#remove ".realigned.vcf.gz" from sample names in Excel
=LEFT(I2, LEN(I2) -15)
#To match pollinator to sample number
=INDEX(PollinatorRange,MATCH(Sample,SampleRange,0))
=INDEX(K$2:K$265,MATCH(Z1,J$2:J$265,0))
```