Brendan's DENU pt. 2 - Amplicon Processing

--- title: "Brendan's DENU pt. 2 - Amplicon Processing" breaks: FALSE tags: RADSeq --- Author: Brendan Connolly (p31939) --- # Amplicon analysis - SNPS Steps: 1.) FastQC 2.) Illumina Adapter trimming and filtering out reads <30 bases long (Trimmomatic) 3.) Alignment (BWA-MEM, (see minimap2 for long-read alignment; map to ddRad FASTA file)) 4.) Primer trimming (Primerclip or TrimPrimers) 5.) On-target and coverage uniformity calculation using Picard tools 6.) Variant calling of your choice (GATK Haplotype Caller for germline calls, lofreq, mutect or Vardict for somatic or low frequency variant calls) Results in VCF file - establishing premise that genetic distance between moms; closer = closer related - how inbred are individuals - delete mom's DNA from data set, just leaving us with Dad's. We can work out pollen diversity. 1.) relatedness 2.) Pollen diversity 3.) Extent of homozygozity (Ho) Conduct t-test between pollinator groups ## 2.) Trimmomatic [Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md) Most files coming straight from the sequencing facility have a few extra characters in the names. Clean up the file names that have something like samplename_S79_R1_001.fastq.gz: ```bash= #to remove the S79: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done #to remove the L001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_L001*_/_/')"; done ``` Trimmomatic filter out low quality reads, remove adapter sequences, etc. Trimmomatic conda environment needs to be activated before running. ```bash= conda activate trimmomatic ``` :::spoiler Trimmomatic Arguments - `PE -phred33` refers to the quality scoring scheme that is used in the raw reads - `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on - `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files - The remaining arguments informs trimmomatic's decision in keeping/removing sequences ::: ```bash= #Use nano to create a shell script and name it -> trimmomatic.sh #For loop to run on folder of samples for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` This should be run using a slurm environment. Create a slurm environment using nano and name it `slurm_trimmomatic.sh` ``` #!/bin/bash #SBATCH --account=p31939 #SBATCH --partition=normal #SBATCH --time=24:00:00 #SBATCH --mem=1GB #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=trimmomatic #SBATCH --output=trim.out #SBATCH --error=trimm.error #SBATCH --mail-type=ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu bash ./trimmomatic.sh ``` Trimmomatic cannot utilize more than 1GB of memory, so increasing does not make it run faster. I had ~250 files and ~168GB of data, which took a little over 14 hours to run. To run use sbatch command ```bash= sbatch slurm_trimmomatic.sh ``` After trimmomatic runs you will have the original file, R1 and R1 paired files, and R1 and R2 unpaired files. Examples: HE1_R1.fastq.gz HE1_R2.fastq.gz HE1_R1_paired.fastq.gz HE1_R2_paired.fastq.gz HE1_R1_unpaired.fastq.gz HE1_R2_unpaired.fastq.gz For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file. To concatenate the two unpaired files into one ```bash= for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` Take a moment to organize your files. Make a new directory for your ./raw_reads and within that, a new directory for your ./unconcatenated_unpaired_reads. Now you should only have these three files per sample: HE1_R1_paired.fastq.gz HE1_R2_paired.fastq.gz HE1_unpaired.fastq.gz ## 3.) Alignment - BWA BWA allows us to align our reads with a reference genome. We're going to align reads to the FASTA file of the 96 loci that we created primers for First let's install BWA and Samtools ``` conda create --name BWA conda activate BWA conda install bioconda::bwa conda install bioconda::samtools ``` ### Indexing ``` bwa index Final96_loci_sequences.fa ``` ### Alignment Testing on a single sample ``` bwa mem -M -t 4 Final96_loci_sequences.fa 69-F6-BU-3_R1_paired.fastq.gz 69-F6-BU-3_R2_paired.fastq.gz > 69-F6-BU-3.sam #Looking at how successful mapping was samtools flagstat 69-F6-BU-3.sam #Turning sam into bam (more storage efficient) samtools view 69-F6-BU-3.sam -b -o 69-F6-BU-3.bam #Sorting bam file samtools sort 69-F6-BU-3.bam -o 69-F6-BU-3_sort.bam ``` Looks good, let's try it on all samples now. Make the following script called align_sort.sh and make a new folder called ./align. Run the script through Slurm. This takes a while to run, I split my 266 samples into groups of ~26 by moving them into seperate directories, then ran the below script on each directory so that BWA would run faster. BWA also allows multiple threads - Quest allows up to 28. Change -t to 12 and ntasks-per-node to 12. ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=48:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=12 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=6G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=bwa_align ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=bwa_align.error # Enter a error log file name #SBATCH --output=bwa_align.out # Enter your out.log file name module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/BWA INDS=($(for i in ./*R1_paired.fastq.gz; do echo $(basename ${i%_R*}); done)) for IND in ${INDS[@]}; do # declare variables FORWARD=./${IND}_R1_paired.fastq.gz REVERSE=./${IND}_R2_paired.fastq.gz OUTPUT=./align/${IND}_sort.bam # then align and sort echo "Aligning $IND with bwa" bwa mem -M -t 12 Final96_loci_sequences.fa $FORWARD $REVERSE | samtools view -b | \ samtools sort -T ${IND} > $OUTPUT done ``` ## 4.) PrimerClip Primerclip is used further remove low quality bases and PCR primer sequeunces from trimmed and aligned reads. https://github.com/swiftbiosciences/primerclip We need a Masterfile of loci of interest and PCR primers; IDT directly sent me one for my loci. This Stack Exchange has more info on the masterfile. :::spoiler Format of masterfile is a .txt with 12 columns: 1) Chromosome 2) Target Start 3) Target Stop 4) Target Name (change which affected everything downstream in counting) 5) 5' Primer Start 6) 5' Primer Stop 7) 5' Primer Name 8) 3' Primer Start 9) 3' Primer Stop 10) 3' Primer Name 11) 5' Primer Sequence 12) 3' Primer Sequence ::: A BEDPE file can also be used. https://bioinformatics.stackexchange.com/questions/5489/what-is-the-primer-masterfile-format Primer Clip expects input files to be .sam, we have to turn the .bam files back into .sam, then name sort each, then we can run primer clip. Before running let's make new conda env called primerclip. ``` conda create -n primerclip conda activate primerclip conda install bioconda::primerclip conda install bioconda/label/cf201901::samtools #Ran into samtools error: "Samtools shared library libcrypto.so.1.0.0 not found". Found this solution where you replace libcrypto.so.3 with libcrypto.so.1.0.0 cd /home/bjc8165/miniconda3/envs/primerclip/lib ln -s libcrypto.so.3 libcrypto.so.1.0.0 ``` Make the following slurm script for runnign primer clip - PrimerClip.sh The sam files are huge and 1tb of space on Quest was not enough storage, so I transferred everything to Curie and ran the below script. ``` #!/bin/bash #Load Conda Environment source /home/bjc8165/miniconda3/bin/activate primerclip INDS=($(for i in ./*_sort.sam; do echo $(basename ${i%_sort*}); done)) #declares our individual names for IND in ${INDS[@]}; do samtools sort -n ${IND}_sort.sam -o ${IND}_namesort.sam; primerclip Primer_Sequences_DelphiniumAmplicon.txt ${IND}_namesort.sam ${IND}_clipped.sam; samtools view -b ${IND}_clipped.sam -o ${IND}_clipped.bam; samtools sort ${IND}_clipped.bam -o ${IND}_clipped_sort.bam; done ``` Let's remove the intermediary files and make a md5sum file for everything here for when we tranfer it back to quest ``` #Remove intermediary name-sorted files rm *name* *clipped.sam *clipped.bam #Organize Files mv *clipped_sort.bam ./home/bjc8165/clipped mv *.log ./home/bjc8165/clipped #Generate md5sum file compiling md5hashes for everything in directory find -type f -exec md5sum '{}' \; > curie_samsorted_md5sum.txt find -type f -exec md5sum '{}' \; > curie_samclipped_md5sum.txt #Run in powershell to transfer md5 files to local scp bjc8165@10.2.0.53:/home/bjc8165/PrimerClip/curie_samsorted_md5sum.txt C:/Users/Brendan/Downloads scp bjc8165@10.2.0.53:/home/bjc8165/clipped/curie_samclipped_md5sum.txt C:/Users/Brendan/Downloads ``` :::spoiler Trimmal - MAY NOT NEED TO DO When sequences are aligned, a lot of gaps are introduced, especially towards the beginning/ends of sequences. This is because all sequences must end up being the same length, which is at minimum the longest sequence in the file. We can get rid of base pair positions that are mostly gaps using trimAl. Use -gt 0.5 to remove base pair positions that are 50% gaps. Note I installed trimAl in my conda environment instead of using modules on quest. Before running TrimAl, you need to create a conda environment and download it. To do this first create an environment ``` conda create --name trimal #Then activate the environment conda activate trimal #Then download trimal conda install trimal ``` Now you can run the script below ``` #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=short #SBATCH --time=00:30:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_trim #SBATCH --error=trim.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu for i in *.fasta; do trimal -in $i -out ${i//aligned/trim} -gt 0.5 done ``` ::: ## 5.) On-target and coverage uniformity calculation using Picard tools we need to first remove duplicate reads from the dataset to avoid PCR duplicates and technical duplicates which inflate our sequencing depth and give us false certainty in the genotype calls. We can use Picard Tools to do this: https://broadinstitute.github.io/picard/ OR GATK, which has Picard built in: https://github.com/broadinstitute/gatk ``` conda create -n GATK conda activate GATK conda install bioconda::gatk4 conda install bioconda::picard module load bamtools ``` Calculating depth of coverage at each loci. This can be done with Picard and Bedtools. First we'll use Bedtools as it's easier to use ``` ############################################### ##Use bedtools to check coverage of each loci## ############################################### #We need to first turn out fasta file to a .dict file picard CreateSequenceDictionary / R=Final96_loci_sequences.fa / O=Final96_loci_sequences.dict #Then Index our fasta file samtools faidx Final96_loci_sequences.fa #Coverage with bedtools #Make a new interval file with loci names pulled from a bed file - The coordinates for the target loci here are 1 to the end of the loci as we are also interested in variant sites on either side of the original SNP samtools view -H 108-109_NF1_clipped_sort.bam | grep SN | cut -f 2,3 | sed 's/SN://g' | sed 's/LN://g' | awk '{print $1 "\t1\t" $2}' > bedtools_interval.bed #Index each bam file samtools index 109-F4-BU-3_clipped_sort.bam #Checking coverage with bedtools bedtools coverage -a bedtools_interval.bed -b 109-F4-BU-3_clipped_sort.bam #For loop to do everything at once: module load bedtools INDS=($(for i in ./*_clipped_sort.bam; do echo $(basename ${i%_clipped*}); done)) #declares our individual names for IND in ${INDS[@]}; do OUTPUT=./coverage_logs/${IND}_coverage_log samtools index ${IND}_clipped_sort.bam; bedtools coverage -a bedtools_interval.bed -b ${IND}_clipped_sort.bam > $OUTPUT; done ################################################## Merging all samples into one and checking coverage ################################################## samtools merge -n allfiles.bam *.bam ############################################################### ### Work in Progress: Extra code, if you're using PicardTools## ############################################################### #We need to make a interval_list that has loci name and position of loci #We only need first three columns from the sequence.txt file cut -f 1-3 Primer_Sequences_DelphiniumAmplicon.txt Primer_Sequences_DelphiniumAmplicon.bed picard BedToIntervalList -I Primer_Sequences_DelphiniumAmplicon.bed -O interval_list -SD Final96_loci_sequences.fa #Remove headers from interval file, Picard needs headers, but Bamtools does not #grep -v "^@" test.interval_list > no_headers.interval_list #Duplicate the interval list and call it the target list cp test.interval_list test.target_list picard CollectTargetedPcrMetrics -I 108-109_NF1_clipped_sort.bam -O test_pcr_metrics.txt -AI test.interval_list -TI test.target_list ``` Picard let's you calculate a ton of statistics - https://broadinstitute.github.io/picard/picard-metric-definitions.html#TargetedPcrMetrics . Here are some that might be useful from the Targeted PCR Metrics section: ``` CUSTOM_AMPLICON_SET #The name of the amplicon set used in this metrics collection run PF_UNIQUE_READS #The number of PF_READS that were not marked as sample or optical duplicates. ON_AMPLICON_BASES #The number of PF_BASES_ALIGNED that mapped to an amplified region of the genome. NEAR_AMPLICON_BASES #The number of PF_BASES_ALIGNED that mapped to within a fixed interval of an amplified region, but not on a baited region. PCT_AMPLIFIED_BASES #The fraction of PF_BASES_ALIGNED that mapped to or near an amplicon, (ON_AMPLICON_BASES + NEAR_AMPLICON_BASES)/PF_BASES_ALIGNED. MEAN_AMPLICON_COVERAGE #The mean read coverage of all amplicon regions in the experiment. MEAN_TARGET_COVERAGE #The mean read coverage of all target regions in an experiment. MEDIAN_TARGET_COVERAGE #The median coverage of reads that mapped to target regions of an experiment. FOLD_ENRICHMENT #The fold by which the amplicon region has been amplified above genomic background. PCT_EXC_DUPE #The fraction of aligned bases that were filtered out because they were in reads marked as duplicates. gatk ValidateSamFile -I input.bam -MODE SUMMARY ``` ``` cat inds | parallel --verbose -j 10 \ java -Xmx1g -jar /home/scripts/picard.jar \ MarkDuplicates REMOVE_DUPLICATES=true \ ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \ INPUT={}.bam \ OUTPUT={}.rmd.bam \ METRICS_FILE={}.rmd.bam.metrics # Now we need to index all bam files again and that's it! samtools index *.rmd.bam ``` ## 6.) Variant Calling with STACKS - Code works but threw away too many loci :::spoiler Variant Calling - STACKS ref_map.pl ``` #Loading stacks as a new conda env conda create -n STACKS conda activate STACKS conda install bioconda::stacks ``` navigate to folder with .bam files ``` ref_map.pl --samples ./ --popmap ./pop_map.txt -s clipped.sort --out-path ./stacks_r0.2 -X "populations:-r 0.2" -X "populations:--fasta-loci" -X "populations:--fasta-samples" -X "populations:--vcf" -X "populations:--genepop" -X "populations:--fstats" -X "populations:--hwe" ``` Gstacks fails when you try to run it on all samples. I had to make a population map **with just the 80 largest samples for it to run**. Apparently this is a common problem with samples with low read depth. The only loci that Stacks output were SNPs in the microsatellite sites. This isn't helpful since we already analyzed these below. Let's try GATk ::: ## 6.) Variant calling - GATK Scripts to run GATk haplotype caller. Execute each script in directory with .bam files #### Setup for GATk ``` # Create conda env for GATk4 conda create -n GATk4 conda activate GATk4 conda install -c bioconda gatk4 conda install bioconda::samtools #Make directory for output files mkdir GATk ``` #### Optional: Remove microsatelite reads before processing - faster name script: rm_micro.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 #Removes any reads that start with K, which all our micro loci do INDS=($(for i in ./*sort.bam; do echo $(basename ${i%_sort*}); done)) for IND in ${INDS[@]}; do # declare variables INP=${IND}_sort.bam OUTPUT=${IND}_nomicro.bam samtools view -h $INP | awk '$3 !~ /^K/' | samtools view -Sb > $OUTPUT done # Sorts each bam file. INDS=($(for i in ./*nomicro.bam; do echo $(basename ${i%_nomicro.bam}); done)) for IND in ${INDS[@]}; do # declare variables INP=${IND}_nomicro.bam OUTPUT=${IND}_nomicro_sort.bam samtools sort $INP -o $OUTPUT done rm *nomicro.bam ``` #### Add sample names to @RG tags - rename.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 INDS=($(for i in ./*nomicro_sort.bam; do echo $(basename ${i%_nomicro_sort*}); done)) for IND in ${INDS[@]}; do # declare variables INP=${IND}_nomicro_sort.bam OUTPUT=${IND}_rename.bam gatk AddOrReplaceReadGroups \ -I $INP \ -O $OUTPUT \ -RGID $IND \ -RGLB 1 \ -RGPL ILLUMINA \ -RGPU ${IND}_1 \ -RGSM $IND done ``` #### Index the renamed files - index.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 INDS=($(for i in ./*rename.bam; do echo $(basename ${i%_r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_rename.bam echo "Indexing $IND with samtools" samtools index $INP done ``` #### Run GATk haplotype caller - gatk_haplo.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 # Define variables INDS=($(for i in ./*rename.bam; do echo $(basename ${i%_rename.bam*}); done)) #declares our individual names for IND in ${INDS[@]}; do gatk HaplotypeCaller \ -R Final96_loci_sequences.fa \ -I ${IND}_rename.bam \ -O ./GATk/${IND}.g.vcf.gz \ -ERC GVCF done ``` #### Combine gVCFs - gatk_combine.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 # Initialize the GVCF list GVCF_LIST="" # Loop through all GVCF files in the directory for GVCF in ./GATk/*.g.vcf.gz; do GVCF_LIST+=" --variant ${GVCF}" done gatk CombineGVCFs \ -R Final96_loci_sequences.fa \ $GVCF_LIST \ -O ./GATk/combined.g.vcf.gz ``` #### genotype the combined gVCF - gatk_geno_gvcf.sh ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 gatk --java-options "-Xmx8g" GenotypeGVCFs \ -R Final96_loci_sequences.fa \ -V ./GATk/combined.g.vcf.gz \ -O ./GATk/final.vcf.gz ``` #### To run all scripts at once, do this ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=16:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=4G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need$#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=gatk_all_noprimerclip.error # Enter a error log file name #SBATCH --output=gatk_all_noprimerclip.out # Enter your out.log file name bash ./rm_micro.sh bash ./rename.sh bash ./index.sh bash ./gatk_haplo.sh bash ./gatk_combine.sh bash ./gatk_geno_gvcf.sh ``` ## 7.) VCF Filtering ``` conda create -n vcftools conda activate vcftools conda install bioconda::vcftools cd ./GATk gunzip final.vcf.gz mkdir vcftools_stats ``` #### Stats on unfiltered VCF to get VCF filters ``` #!/bin/bash module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/vcftools OUT=./vcftools_stats/final #calculate allele frequency - exclude any sites with more than 2 allelels vcftools --vcf final.vcf --freq2 --out $OUT --max-alleles 2 #Calculate mean depth per individual vcftools --vcf final.vcf --depth --out $OUT #Calculate mean depth per site vcftools --vcf final.vcf --site-mean-depth --out $OUT #calculate site quality vcftools --vcf final.vcf --site-quality --out $OUT #Calculate proportion of missing data per individual vcftools --vcf final.vcf --missing-indv --out $OUT #Calculate proportion of missing data per site vcftools --vcf final.vcf --missing-site --out $OUT #Calculate heterozygosity and inbreeding coefficient per individual vcftools --vcf final.vcf --het --out $OUT ``` #### Analyze stats in R download each file, place in working directory for R ##### R Script: Stats across variants ``` ##########variant quality #This is Phred encoded - Phred of 30 = a 1 in 1000 chance that a SNP call is erroneous #OVerall quality is very high. Let's just set filter to 30 var_qual <- read_delim("./SNP_OverallStats/final.lqual", delim = "\t", col_names = c("chr", "pos", "qual"), skip = 1) a <- ggplot(var_qual, aes(qual)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3)+ scale_x_continuous(limits = c(0, 1000)) a + theme_light() ##########Variant mean depth #This is the mean of the read depth across all individuals - it is for both alleles at a position #Mean = 9.39. Median = 0.72. Max = 251. Most are between 0 and 3 dp # As the outliers show, some regions clearly have extremely high coverage and this likely reflects mapping/assembly errors and also paralogous or repetitive regions. We should exclude. A good rule of thumb is something the mean depth x 2 - so in this case we could set our maximum depth at 20x. # Many sites have such a low read coverage, we likely can't be too confident in these calls. Let's follow a good rule of thumb with a minimum depth of 10x var_depth <- read_delim("./SNP_OverallStats/final.ldepth.mean", delim = "\t", col_names = c("chr", "pos", "mean_depth", "var_depth"), skip = 1) summary(var_depth$mean_depth) b <- ggplot(var_depth, aes(mean_depth)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) + scale_x_continuous(limits = c(0, 115)) b + theme_light() #########Variant missingness - REVISIT # the proportion of missingness at each variant. This is a measure of how many individuals lack a genotype at a call site #Lots of missing data. Mean = 87% missing, median = 99% missing var_miss <- read_delim("./SNP_OverallStats/final.lmiss", delim = "\t", col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1) summary(var_miss$fmiss) c <- ggplot(var_miss, aes(fmiss)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) c + theme_light() ########## Minor allele frequency - REVISIT # mean=.166, median = 0.063 # 266 samples = 532 possible alleles at a site # The average MAF therefore means an allele appears in ~88 samples (.166 * 523) # A MAF of 0.1 then means allele appears in ~53 samples #It's is best practice to produce one dataset with a good MAF threshold and keep another without any MAF filtering at all. #For now however, we will set our MAF to 0.1 var_freq <- read_delim("./SNP_OverallStats/final.frq", delim = "\t", col_names = c("chr", "pos", "nalleles", "nchr", "a1", "a2"), skip = 1) var_freq$maf <- var_freq %>% dplyr::select(a1, a2) %>% apply(1, function(z) min(z)) #Find minor allele freq. summary(var_freq$maf) d <- ggplot(var_freq, aes(maf)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) d + theme_light() ``` ##### Stats across samples ``` #######Mean depth per individual #mean = 9.38, median = 5.36, max = 68.25 #There's around 3 samples that seem like large outliers we may want to exclude - really high read depth can bias results #8_Flav, 102-F2-GH-37, 109-F4-BU-3 ind_depth <- read_delim("./SNP_OverallStats/final.idepth", delim = "\t", col_names = c("ind", "nsites", "depth"), skip = 1) summary(ind_depth$depth) e <- ggplot(ind_depth, aes(depth)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 100) e + theme_light() ########Proportion of missing data per individual #mean = 87.2%, median = 89.2%, min = 53.0% #No surprise, but we have a lot of missing data that we should filter out.. ind_miss <- read_delim("./SNP_OverallStats/final.imiss", delim = "\t", col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"), skip = 1) summary(ind_miss$fmiss) f <- ggplot(ind_miss, aes(fmiss)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 100) f + theme_light() ##########Heterozygosity and inbreeding coefficient per individual #A couple samples have really (-) Fis, which we should consider removing - indicitative of allelic dropout #Several also have really high Fis of 1.0, which we should consider removing too. ind_het <- read_delim("./SNP_OverallStats/final.het", delim = "\t", col_names = c("ind","ho", "he", "nsites", "f"), skip = 1) g <- ggplot(ind_het, aes(f)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 50) g + theme_light ``` #### VCF Filters summary: **--remove-indels** - remove all indels (SNPs only) **--remove top 25% of samples with missing data** **--maf 0.05** - set minor allele frequency - 0.05 here **--max-missing 0.5** - set minimum non-missing data. A little counterintuitive - 0 is totally missing, 1 is none missing. Here 0.5 means we will tolerate 50% missing data for a site. LOWERING THIS INCREASES SITES **--minQ 30** - The minimum quality score required for a site to pass our filtering threshold. **--min-meanDP 5** - the minimum mean depth for a site. This is what's filtering out a lot of SITES. 10 IDEAL. **--max-meanDP 80** - the maximum mean depth for a site. 80 might be a bit high. 40 filters out everything **--minDP 5** - the minimum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype. 10 would be ideal. **--maxDP 80** - the maximum depth allowed for a genotype - any individual failing this threshold is marked as having a missing genotype. 40 would be better **Let's also remove the 25% of samples with most missing data** :::spoiler rm.ind ``` 92-F1-GH-6 125B-F13-BU-6 69-F6-BU-7 92-F1-GH-7 70-71_NF1 125B-F13-BU-5 3-F6-GH-13 52-F5-BU-7 65-F3-BU-4 69-F7-BU-2 92-F1-GH-4 125B-F13-BU-7 70B-F2-GH-20 70B-F2-GH-23 173_Control 70A-F1-GH-17 102-F2-GH-9 109-F3-BU-4 109-F3-BU-7 68-F3-BU-9 69-F6-BU-1 125B-F13-BU-8 170_Control 101-F3-BU-1 125B-F13-BU-3 186_Control 45-F2-GH-1 45-F2-GH-17 60_NF1 70B-F2-GH-24 84_Appo 100-104_NF1 101-F1-GH-10 109-F4-BU-5 44-F2-BU-3 52-F3-GH-8 52-F5-BU-5 60-F7-BU-8 64-F6-BU-5 65-F3-BU-3 70A-F1-GH-4 184_Control 3-F8-BU-2 52-F4-GH-17 60-F5-GH-25 60-F7-BU-1 60-F7-BU-6 69-F7-BU-7 70B-F2-GH-19 91-F1-GH-14 92-F2-BU-2 101-F1-GH-6 102_Appo 109-F4-BU-2 126_Flav 28-F8-BU-2 3-F8-GH-4 4-F7-BU-1 60-F6-BU-1 65-F2-GH-27 91-F1-GH-13 69-F6-BU-5 83-84_NF1 101-F1-GH-13 109-F3-BU-5 34_Appo_Flav 52-F5-BU-6 109-F3-BU-2 185_Control 104-F3-BU-3 91-F2-BU-6 102-F2-GH-20 104-F3-BU-5 66B-F6-BU-8 100_Appo 92-F1-GH-6 8_Flav 102-F2-GH-37 109-F4-BU-3 ``` ::: Applying filters ``` #Remove the samples with a lot of missing data vcftools --vcf final.vcf --remove rm.ind --recode --recode-INFO-all --out subset vcftools --vcf subset.recode.vcf --remove-indels --maf 0.05 --max-missing 0.50 --minQ 30 --min-meanDP 5 --max-meanDP 80 --minDP 5 --maxDP 40 --recode --recode-INFO-all --out maf.05_missing0.50_minq30_minDP5_maxmeanDP80_noprimerclip.subset ``` Download .vcf and analyze in R. The above filter results in 129 samples across 3 SNPs and 1 loci. Shows flavifrons is greater then Appo. ## Trying GTseq.pl pipeline obtained from Campbell et. al. 2014 https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12357 ``` #Creating conda environments conda activate repeatseq conda install git conda install conda-forge::cmake #Clone gtseq in chosen directory git clone https://github.com/GTseq/GTseq-Pipeline.git #Merge forward and reverse reads gunzip *.gz seqtk mergepe *R1.fastq *R2.fastq > combined.fastq perl /projects/p31939/DENU_Amplicon/snp_gtseq/GTseq-Pipeline/GTseq_Genotyper.pl LocusInfo.csv combined.fastq > combined.genos ``` # Amplicon analysis: Microsats Paper with pipeline (Pimentel et. al. 2018) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5855144/pdf/fgene-09-00073.pdf Paper with R analysis pipeline (Perry et. al. 2024) https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13933 ## 1.) Alignment of only microsat reads ### Bowtie2 - local alignment Let's try a local alignment, which uses soft-clipping and is less conservative of an alignment than an end-to-end alignment. Trying this as the microsat sequences we are aligning to are from other Delphinium taxa, plus local alignments may be able to better handle indels (aka microsatellite alleles) #### Additional resources for reference Biostar post on local alignments: https://www.biostars.org/p/369638/ Post on dealing with soft-clipping: https://www.biostars.org/p/130374/ Pipeline for re-aligning with BWA instead of bowtie: https://github.com/adaptivegenome/repeatseq Tutorials on variant calling with samtools: https://angus.readthedocs.io/en/2017/variant-calling.html https://samtools.sourceforge.net/mpileup.shtml https://cloud.wikis.utexas.edu/wiki/spaces/bioiteam/pages/47729784/Variant+calling+using+SAMtools https://wurmlab.com/genomicscourse/2016-SIB/practicals/population_genetics/map_call #### Creating index First have a fasta file with just the microsat regions ``` conda create -n bowtie2 conda activate bowtie2 conda install bioconda::bowtie2 conda install bioconda::samtools conda install bioconda::bcftools bowtie2-build ./micro_sequences.fa micro_sequences ``` #### Performing local alignment ``` #create output directory mkdir bowtie_local ``` Script to run for alignment - Note that for samples that are < 1gb, this runs pretty fast. Can take much longer for larger samples. I split my samples into subset directories and ran this script in each folder for efficiency. ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=8:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=8G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))#SBATCH --job-name=bowtie_align1 ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=bwa_align1.error # Enter a error log file name #SBATCH --output=bwa_align1.out # Enter your out.log file name module purge eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/bowtie2 INDS=($(for i in ./*R1_paired.fastq.gz; do echo $(basename ${i%_R*}); done)) for IND in ${INDS[@]}; do # declare variables FORWARD=./${IND}_R1_paired.fastq.gz REVERSE=./${IND}_R2_paired.fastq.gz OUTPUT=./bowtie_local/${IND}_sort.bam # then align and sort echo "Aligning $IND with bowtie2" bowtie2 -p 2 -q --very-sensitive-local -x micro_sequences -1 $FORWARD -2 $REVERSE --rg-id $IND --rg SM:$IND --rg LB:1 --rg PU:${IND}_1 --rg PL:ILLUMINA | samtools view -b | samtools sort -T ${IND} > $OUTPUT done ``` ### Marking Duplicate Reads Downstream programs seem to expect you to do this. Command also creates index file ``` #Installing GATk4 conda create -n GATk4 conda activate GATk4 conda install -c bioconda gatk4 conda deactivate conda activate GATk4 cd bowtie_local mkdir dup_marked #Marking Duplicates #!/bin/bash eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk4 INDS=($(for i in ./*sort.bam; do echo $(basename ${i%_sort.bam}); done)) for IND in ${INDS[@]}; do # declare variables INP=${IND}_sort.bam OUTPUT=./dup_marked/${IND}_marked.bam # then align and sort echo "Marking Duplicates in $IND" gatk --java-options "-Xmx8G" MarkDuplicates --MAX_OPTICAL_DUPLICATE_SET_SIZE -1 --CREATE_INDEX true -I $INP -O $OUTPUT -M ${IND}_dup_metrics.txt done ``` **The two largest files run into core dumping errors when trying to mark duplicates** (102-F2-GH-37_sort.bam, 8_Flav_sort.bam). Not sure how to fix, files dropped. ## 2.) Local re-alignment around indels These are essentially indels, which can cause errors with microsat variant calling. We want to identify variants in/around microsat region and then re-align. ### Index files Make sure you have the fasta and it's indexes in this folder #### SKIP: Index samples ``` #Indexing all samples conda deactivate conda activate bowtie2 INDS=($(for i in ./*realigned.bam; do echo $(basename ${i%_r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_realigned.bam echo "Indexing $IND with samtools" samtools index $INP done ``` #### Index fasta ``` conda activate bowtie2 samtools faidx micro_sequences.fa ``` ### Samtools Variant Calling - want VCF of only indels I need to do more research on what filters to apply, but this base code gets you a VCF of variant sites at least Notes on here: https://www.biostars.org/p/9466154/ ``` bcftools mpileup -Ou -d 8000 -x -B -h 500 -m 3 -f micro_sequences.fa *.bam | bcftools call -V snps -mv -Ov -o calls_-d8000_-B_-h500_-m3.vcf #Shows how many indels were found grep -c "INDEL" calls.vcf ``` ### Realigning reads around indels GatK has RealignerTargetCreater and IndelRealigner to handle this. Note that these tools were removed from GATk versions after 3.6. So we'll download 3.6 for these tools, and GATk4 for subsequent tools Tutorial: https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-tutorials/(howto)_Perform_local_realignment_around_indels.md #### Setup ``` #Installing GATk3.6 conda create -n GATk conda activate GATk conda install -c bioconda gatk=3.6 conda deactivate #Make a .dict file of reference fasta conda activate GATk4 gatk CreateSequenceDictionary -R micro_sequences.fa conda deactivate ``` #### Running Realigner Target Creator :::spoiler script ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=1:30:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=2G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)$ #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=IndelTargetCreator_d4000_B_x_h500_m3.error # Enter a error log file name #SBATCH --output=IndelTargetCreator_d4000_B_x_h500_m3.out # Enter your out.log file name eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk gatk3 -T RealignerTargetCreator -R micro_sequences.fa -known calls_-d4000_-B_-x_-h500_-m3_INDEL.vcf \ -I 109-F4-BU-3_marked.bam \ -I 68-F3-BU-2_marked.bam \ -I 34A-F1-GH-9_marked.bam \ -I 69-F7-BU-3_marked.bam \ -I 69-F7-BU-4_marked.bam \ -I 52-F5-BU-3_marked.bam \ -I 91-F2-BU-3_marked.bam \ -I 4-F7-BU-1_marked.bam \ -I 60-F7-BU-2_marked.bam \ -I 68-F3-BU-5_marked.bam \ -I 60-F7-BU-1_marked.bam \ -I 4_Flav_marked.bam \ -I 92-F2-BU-1_marked.bam \ -I 65-F3-BU-2_marked.bam \ -I 68-F3-BU-4_marked.bam \ -I 29_Appo_marked.bam \ -I 69_Flav_marked.bam \ -I 3-F8-BU-2_marked.bam \ -I 69-F6-BU-4_marked.bam \ -I 69-F7-BU-2_marked.bam \ -I 69-F6-BU-3_marked.bam \ -I 60-F6-BU-1_marked.bam \ -I 60-F6-BU-3_marked.bam \ -I 69-F6-BU-1_marked.bam \ -I 68-F3-BU-3_marked.bam \ -I 64-F6-BU-4_marked.bam \ -I 70A_Appo_marked.bam \ -I 65_Flav_marked.bam \ -I 70B_Appo_marked.bam \ -I 3-F8-BU-1_marked.bam \ -I 67_Flav_marked.bam \ -I 44-F1-GH-45_marked.bam \ -I 66_Flav_marked.bam \ -I 68_Flav_marked.bam \ -I 52_Appo_marked.bam \ -I 45_Appo_marked.bam \ -I 69-F7-BU-1_marked.bam \ -I 91_Appo_marked.bam \ -I 86_Appo_marked.bam \ -I 1_Flav_marked.bam \ -I 92_Appo_marked.bam \ -I 28_Appo_marked.bam \ -I 109-F4-BU-2_marked.bam \ -I 101-F3-BU-2_marked.bam \ -I 92-F2-BU-2_marked.bam \ -I 102-F2-GH-7_marked.bam \ -I 63_Flav_marked.bam \ -I 91-F2-BU-4_marked.bam \ -I 104_Appo_marked.bam \ -I 125B-F13-BU-2_marked.bam \ -I 108_Appo_marked.bam \ -I 2_Flav_marked.bam \ -I 28-F10-BU-1_marked.bam \ -I 109_Appo_marked.bam \ -I 3_Flav_marked.bam \ -I 71_Appo_marked.bam \ -I 44-F1-GH-35_marked.bam \ -I 29-F4-GH-29_marked.bam \ -I 52-F3-GH-17_marked.bam \ -I 71-F2-GH-16_marked.bam \ -I 65-F3-BU-1_marked.bam \ -I 52-F5-BU-2_marked.bam \ -I 66B-F6-BU-1_marked.bam \ -I 3-F6-GH-3_marked.bam \ -I 69-F6-BU-2_marked.bam \ -I 91-F1-GH-19_marked.bam \ -I 71-F2-GH-21_marked.bam \ -I 71-F2-GH-25_marked.bam \ -I 45-F2-GH-29_marked.bam \ -I 64_Flav_marked.bam \ -I 52-F5-BU-4_marked.bam \ -I 60-F6-BU-2_marked.bam \ -I 52-F3-GH-11_marked.bam \ -I 44-F2-BU-5_marked.bam \ -I 125A_Flav_marked.bam \ -I 68-F3-BU-7_marked.bam \ -I 28-F8-BU-1_marked.bam \ -I 65-F3-BU-4_marked.bam \ -I 60-F6-BU-4_marked.bam \ -I 65-F3-BU-3_marked.bam \ -I 45-F2-GH-17_marked.bam \ -I 101-F1-GH-6_marked.bam \ -I 64-F6-BU-1_marked.bam \ -I 29-F4-GH-13_marked.bam \ -I 126_Flav_marked.bam \ -I 44-F2-BU-6_marked.bam \ -I 45-F2-GH-22_marked.bam \ -I 69-F5-GH-25_marked.bam \ -I 60-F7-BU-4_marked.bam \ -I 125B_Flav_marked.bam \ -I 4-F7-BU-2_marked.bam \ -I 52-F3-GH-21_marked.bam \ -I 44-F2-BU-4_marked.bam \ -I 4-F5-GH-3_marked.bam \ -I 44_Appo_marked.bam \ -I 3-F8-BU-3_marked.bam \ -I 60_Flav_marked.bam \ -I 125B-F11-GH-19_marked.bam \ -I 45-F2-GH-7_marked.bam \ -I 64-F6-BU-3_marked.bam \ -I 104-F3-BU-1_marked.bam \ -I 103_Appo_marked.bam \ -I 71-F2-GH-28_marked.bam \ -I 70A-F1-GH-2_marked.bam \ -I 125B-F12-BU-1_marked.bam \ -I 44-F2-BU-3_marked.bam \ -I 28-F8-BU-3_marked.bam \ -I 102-F2-GH-16_marked.bam \ -I 66-69_NF1_marked.bam \ -I 109-F4-BU-1_marked.bam \ -I 125B-F10-GH-3_marked.bam \ -I 66B-F6-BU-3_marked.bam \ -I 70B-F2-GH-4_marked.bam \ -I 101-F1-GH-20_marked.bam \ -I 44-45_NF1_marked.bam \ -I 70A-F1-GH-5_marked.bam \ -I 125B-F11-GH-3_marked.bam \ -I 91-F2-BU-9_marked.bam \ -I 165_Control_marked.bam \ -I 44-F1-GH-17_marked.bam \ -I 66B-F6-BU-7_marked.bam \ -I 60-F6-BU-5_marked.bam \ -I 60-F6-BU-7_marked.bam \ -I 69-F7-BU-6_marked.bam \ -I 70A-F1-GH-36_marked.bam \ -I 92-F2-GH-15_marked.bam \ -I 66B-F6-BU-2_marked.bam \ -I 28-F8-BU-4_marked.bam \ -I 69-F6-BU-6_marked.bam \ -I 175_Control_marked.bam \ -I 125B-F11-GH-24_marked.bam \ -I 92-F1-GH-5_marked.bam \ -I 68-F3-BU-6_marked.bam \ -I 125B-F11-GH-20_marked.bam \ -I 104-F3-BU-7_marked.bam \ -I 45-F2-GH-2_marked.bam \ -I 188_Control_marked.bam \ -I 69-F7-BU-5_marked.bam \ -I 70A-F1-GH-7_marked.bam \ -I 3-F6-GH-4_marked.bam \ -I 109-F3-BU-3_marked.bam \ -I 44-F1-GH-30_marked.bam \ -I 104-F2-GH-8_marked.bam \ -I 92-F1-GH-3_marked.bam \ -I 125B-F11-GH-8_marked.bam \ -I 60-F7-BU-9_marked.bam \ -I 183_Control_marked.bam \ -I 1-F7-BU-10_marked.bam \ -I 52-F5-BU-1_marked.bam \ -I 108-109_NF1_marked.bam \ -I 187_Control_marked.bam \ -I 44-F1-GH-18_marked.bam \ -I 109-F4-BU-4_marked.bam \ -I 66B-F6-BU-6_marked.bam \ -I 91-92_NF1_marked.bam \ -I 125-126_NF1_marked.bam \ -I 104-F3-BU-6_marked.bam \ -I 44-F1-GH-21_marked.bam \ -I 44-F2-GH-7_marked.bam \ -I 109-F3-BU-1_marked.bam \ -I 84_Appo_marked.bam \ -I 125B-F12-BU-2_marked.bam \ -I 3-F6-GH-12_marked.bam \ -I 60-F6-BU-6_marked.bam \ -I 91-F1-GH-14_marked.bam \ -I 109-F3-BU-7_marked.bam \ -I 52-F3-GH-15_marked.bam \ -I 92-F1-GH-2_marked.bam \ -I 168_Control_marked.bam \ -I 127_Flav_marked.bam \ -I 28-29_NF1_marked.bam \ -I 29-F4-GH-1_marked.bam \ -I 70B-F2-GH-1_marked.bam \ -I 109-F3-BU-2_marked.bam \ -I 104-F3-BU-8_marked.bam \ -I 125B-F13-BU-8_marked.bam \ -I 45-F2-GH-33_marked.bam \ -I 63-65_NF1_marked.bam \ -I 162_Control_marked.bam \ -I 44-F1-GH-29_marked.bam \ -I 91-F2-BU-5_marked.bam \ -I 172_Control_marked.bam \ -I 70B-F2-GH-23_marked.bam \ -I 60_NF1_marked.bam \ -I 60-F7-BU-6_marked.bam \ -I 8_NF1_marked.bam \ -I 52-F5-BU-5_marked.bam \ -I 86_NF1_marked.bam \ -I 102_Appo_marked.bam \ -I 101-F3-BU-1_marked.bam \ -I 102-F2-GH-9_marked.bam \ -I 184_Control_marked.bam \ -I 104-F3-BU-3_marked.bam \ -I 100-104_NF1_marked.bam \ -I 60-F7-BU-8_marked.bam \ -I 101-F1-GH-23_marked.bam \ -I 109-F3-BU-4_marked.bam \ -I 70B-F3-GH-1_marked.bam \ -I 60-F5-GH-25_marked.bam \ -I 104-F2-GH-44_marked.bam \ -I 104-F3-BU-2_marked.bam \ -I 70A-F1-GH-1_marked.bam \ -I 28-F8-BU-2_marked.bam \ -I 101_Appo_marked.bam \ -I 52_NF1_marked.bam \ -I 125B-F13-BU-7_marked.bam \ -I 68-F3-BU-9_marked.bam \ -I 66B-F7-BU-1_marked.bam \ -I 45-F2-GH-1_marked.bam \ -I 65-F2-GH-27_marked.bam \ -I 2_3_141_142_NF1_marked.bam \ -I 101-F1-GH-13_marked.bam \ -I 70B-F2-GH-19_marked.bam \ -I 91-F1-GH-13_marked.bam \ -I 70A-F1-GH-4_marked.bam \ -I 181_Control_marked.bam \ -I 92-F1-GH-8_marked.bam \ -I 34_NF1_marked.bam \ -I 102-F2-GH-17_marked.bam \ -I 3-F8-GH-4_marked.bam \ -I 7_Flav_marked.bam \ -I 126-F13-GH-11_marked.bam \ -I 101-F1-GH-10_marked.bam \ -I 69-F6-BU-5_marked.bam \ -I 60-F7-BU-5_marked.bam \ -I 186_Control_marked.bam \ -I 70B-F2-GH-24_marked.bam \ -I 28-F9-BU-1_marked.bam \ -I 173_Control_marked.bam \ -I 104-F2-GH-7_marked.bam \ -I 109-F4-BU-5_marked.bam \ -I 70B-F2-GH-33_marked.bam \ -I 52-F4-GH-17_marked.bam \ -I 64-F6-BU-5_marked.bam \ -I 3-F6-GH-13_marked.bam \ -I 69-F7-BU-7_marked.bam \ -I 170_Control_marked.bam \ -I 104-F3-BU-5_marked.bam \ -I 2-F2-GH-7_marked.bam \ -I 66B-F6-BU-4_marked.bam \ -I 69-F6-BU-7_marked.bam \ -I 52-F3-GH-8_marked.bam \ -I 92-F1-GH-7_marked.bam \ -I 70B-F2-GH-20_marked.bam \ -I 185_Control_marked.bam \ -I 91-F2-BU-6_marked.bam \ -I 125B-F11-GH-16_marked.bam \ -I 91-F2-BU-7_marked.bam \ -I 52-F5-BU-6_marked.bam \ -I 125B-F13-BU-1_marked.bam \ -I 66B-F6-BU-8_marked.bam \ -I 109-F3-BU-5_marked.bam \ -I 83-84_NF1_marked.bam \ -I 102-F2-GH-20_marked.bam \ -I 100_Appo_marked.bam \ -I 70-71_NF1_marked.bam \ -I 92-F1-GH-4_marked.bam \ -I 92-F1-GH-6_marked.bam \ -I 125B-F13-BU-6_marked.bam \ -I 125B-F13-BU-5_marked.bam \ -I 34_Appo_Flav_marked.bam \ -I 52-F5-BU-7_marked.bam \ -I 70A-F1-GH-17_marked.bam \ -o realignertargetcreator_d4000_B_x_h500_m3_INDEL.intervals ``` ::: ``` #make new directory for re-aligned files mkdir indel_realigned_d8000_B_x_h500_m3 ``` #### Realign reads using IndelRealigner ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=10G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need) #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=indel_realigned_d8000_B_x_h500_m3.error # Enter a error log file name #SBATCH --output=indel_realigned_d8000_B_x_h500_m3.out # Enter your out.log file name eval "$(conda shell.bash hook)" conda activate /home/bjc8165/.conda/envs/GATk INDS=($(for i in ./*marked.bam; do echo $(basename ${i%_m*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_marked.bam OUTPUT=./indel_realigned_d8000_B_x_h500_m3/${IND}_realigned.bam echo "Realigning $IND with gatk3" gatk3 -Xmx8G -T IndelRealigner \ -R micro_sequences.fa \ -targetIntervals realignertargetcreator_d8000_B_x_h500_m3.intervals \ -known calls_-d8000_-B_-x_-h500_-m3_INDEL.vcf \ -I $INP \ -o $OUTPUT done ``` The following files were truncated when doing realignement. Not sure how to fix. IndelRealigner has some options like -maxReadsInMemory and -xmx but I can't get it to run with maxreadsinmemory ``` 4-F7-BU-1_* 52-F5-BU-3_* 60-F7-BU-1_* 60-F7-BU-2_* 68-F3-BU-2_* 68-F3-BU-5_* 69-F7-BU-3_* 69-F7-BU-4_* ``` ## 3.) Variant Calling - RepeatSeq package: https://github.com/adaptivegenome/repeatseq Repeat Seq citation - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592458/pdf/gks981.pdf Highnam, G., Franck, C., Martin, A., Stephens, C., Puthige, A., and Mittelman, D. (2012). Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41:e32. doi: 10.1093/nar/gks981 ### Installing RepeatSeq ``` #Creating conda environments conda create -n repeatseq conda activate repeatseq conda install git conda install conda-forge::cmake #Clone repeatseq in chosen directory git clone https://github.com/adaptivegenome/repeatseq cd repeatseq git clone https://github.com/pezmaster31/bamtools.git cd bamtools/ git checkout 9cfa70b cd .. git clone https://github.com/ekg/fastahack.git mkdir bamtools/build cd bamtools/build/ cmake .. make #build repeatseq cd ../.. make ``` ### Running RepeatSeq First we need to make an .regions file, which is a tab seperated file, with first column being "lociName:coordinates" and second being "_RepeatMotif" Example: ``` KX377703.1:86-107 _AG KX377702.1:208-224 _CA KJ733933.1:149-162 _TC KJ733931.1:150-164 _CT KJ733930.1:160-182 _AC KT282132.1:45-61 _CA KT282133.1:67-83 _GT KT282134.1:41-57 _GT KT282135.1:68-86 _AG KT282136.1:47-63 _CT KT282137.1:87-103 _AG KT282138.1:78-95 _TC KT282139.1:136-152 _AG KT282140.1:250-269 _TG KT282141.1:49-84 _GA ``` #### RepeatSeq Prep ``` #Set LD_LIBRARY_PATH to bamtools folder export LD_LIBRARY_PATH="/projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/bamtools/lib" #Copy Fasta File to folder with re-aligned BAMs cp micro_sequences.* ./indel_realigned #Change index file names INDS=($(for i in ./*realigned.bai; do echo $(basename ${i%_r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_realigned.bai OUTPUT=${IND}_realigned.bam.bai mv $INP $OUTPUT done ``` #### Run repeatseq This script makes it so that reads must have **EXACT matches with 8bp before and after repeat sequence** ``` #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=short ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=4:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=15G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need) #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=repeatseq_-1flanking_8before_8after.error # Enter a error log file name #SBATCH --output=repeatseq_-1flanking_8before_8after.out # Enter your out.log file name export LD_LIBRARY_PATH="/projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/bamtools/lib" INDS=($(for i in ./*realigned.bam; do echo $(basename ${i%_r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_realigned.bam echo "RepeatSeq on $IND" /projects/p31939/DENU_Amplicon/micro_analysis/test_RG/bowtie_local/repeatseq/repeatseq -L 8 -R 8 -calls -repeatseq $INP micro_sequences.fa exact.regions done #reset LD_LIBRARY_PATH when done unset LD_LIBRARY_PATH ``` ## 4.) Variant Filtering The .repeatseq file contains all reads that matched to repeat motif. But we need to do some manual filtering to turn this into usable data. ### Filtering we want to do (Pimentel et. al. 2018) 1.) Eliminate everything but 2 most common alleles per loci per sample 2.) Remove alleles with < 10 reads 3.) Genotype samples. If minor allele is less than 20% of reads, then sample is homozygous 4.) Remove loci and samples with > 50% missing data ### 4a.) Create file summarizing alleles and their coverage :::spoiler extract_allele_coverage_pt1.sh script ``` #!/bin/bash # Check if at least one file name is provided as an argument if [ $# -eq 0 ]; then echo "Usage: $0 <repeatseq_file1> [repeatseq_file2 ...]" exit 1 fi # Function to process a single file process_file() { local repeatseq_file=$1 local output_file="${repeatseq_file%.repeatseq}_allele_coverage_summary.txt" # Initialize or clear the output file > "$output_file" # Function to extract allele information and coverage from the header line extract_allele_coverage() { local header_line="$1" local start_line="$2" # Extract loci name (first character string after ~) local loci_name=$(echo "$header_line" | awk '{print $1}' | sed 's/~//') # Extract genotype (GT field) local genotype_info=$(echo "$header_line" | grep -oP 'GT:\K[^ ]+') # Extract allele information (A field) local allele_info=$(echo "$header_line" | grep -oP 'A:\K[^C]+') # Print loci name, genotype, and allele information echo "Loci: $loci_name" >> "$output_file" echo "Genotype: $genotype_info" >> "$output_file" # Process each allele and its coverage echo "$allele_info" | tr ' ' '\n' | while IFS= read -r allele; do # Extract allele and coverage from the format "allele[coverage]" local allele_name=$(echo "$allele" | grep -oP '^[0-9]+') local coverage=$(echo "$allele" | grep -oP '\[\K[0-9]+(?=\])') if [ -z "$coverage" ]; then # Coverage value is missing, calculate it coverage=$(awk -v start="$start_line" 'NR > start && /^~/{exit} NR > start {count++} END{print count-1}' "$repeatseq_file") fi if [ -n "$allele_name" ]; then echo "Allele: $allele_name, Coverage: $coverage" >> "$output_file" fi done echo "" >> "$output_file" # Add a blank line for separation } # Track line number and process each record in the file local line_number=0 while IFS= read -r line; do ((line_number++)) if [[ "$line" =~ ^~ ]]; then # Process header line header_line="${line#~}" extract_allele_coverage "$header_line" "$line_number" elif [[ "$line" =~ ^T|^CCG|^GTG ]]; then # Process read lines continue else # Skip non-header and non-read lines (e.g., reference sequence) continue fi done < "$repeatseq_file" echo "Allele coverage has been extracted to $output_file" } # Loop through each provided file and process it for repeatseq_file in "$@"; do if [ ! -f "$repeatseq_file" ]; then echo "File not found: $repeatseq_file" continue fi process_file "$repeatseq_file" done ``` ::: ``` #Make script executable chmod +x ./extract_allele_coverage_pt1.sh #Execute it on all files in directory ./extract_allele_coverage_pt1.sh *.repeatseq ``` ### 4b.) Keep only two most common alleles epr loci per sample :::spoiler filter_2alleles_pt2.sh ``` #!/bin/bash # Function to process the file and filter alleles process_file() { local input_file="$1" local base_name=$(basename "$input_file" "_realigned.bam.L8.R8_allele_coverage_summary.txt") local output_file="${base_name}.filtered_2alleles.txt" # Initialize the output file > "$output_file" local in_loci_section=0 local allele_count=0 # Read the input file line by line while IFS= read -r line; do if [[ $line == Loci:* ]]; then # If in a loci section, reset the allele count if [[ $in_loci_section -eq 1 ]]; then true fi # Start a new loci section in_loci_section=1 allele_count=0 echo "$line" >> "$output_file" elif [[ $line == Genotype:* ]]; then # Write Genotype line to the output file echo "$line" >> "$output_file" elif [[ $line == Allele:* ]]; then if [[ $in_loci_section -eq 1 ]]; then # Only write the first two alleles if [[ $allele_count -lt 2 ]]; then echo "$line" >> "$output_file" fi ((allele_count++)) fi else # Write other lines (if any) to the output file echo "$line" >> "$output_file" fi done < "$input_file" echo "Filtered file created: $output_file" } # Check if at least one argument is provided if [ "$#" -lt 1 ]; then echo "Usage: $0 pattern.txt" exit 1 fi # Process each file matching the pattern for file in "$@"; do for f in $file; do if [ -f "$f" ]; then process_file "$f" else echo "File $f not found!" fi done done ``` ::: ``` #Make script executable chmod +x ./filter_2alleles_pt2.sh #Execute it on all files in directory ./filter_2alleles_pt2.sh *_allele_coverage_summary.txt ``` ### 4c.) Remove alleles with coverage less than 10 :::spoiler filter_alleles_10count_pt3.sh ``` #!/bin/bash # Function to filter alleles with coverage of 10 or more filter_coverage() { local input_file="$1" local base_name=$(basename "$input_file" ".filtered_2alleles.txt") local output_file="${base_name}.filtered_2alleles_10count.txt" # Initialize the output file > "$output_file" local in_loci_section=0 local allele_count=0 # Read the input file line by line while IFS= read -r line; do if [[ $line == Loci:* ]]; then # Start a new loci section in_loci_section=1 allele_count=0 echo "$line" >> "$output_file" elif [[ $line == Genotype:* ]]; then # Write Genotype line to the output file echo "$line" >> "$output_file" elif [[ $line == Allele:* ]]; then if [[ $in_loci_section -eq 1 ]]; then # Extract Coverage value local coverage=$(echo "$line" | awk -F"Coverage: " '{print $2}' | awk '{print $1}') # Only write alleles with coverage of 10 or more if [ "$coverage" -ge 10 ]; then echo "$line" >> "$output_file" fi fi else # Write other lines (if any) to the output file echo "$line" >> "$output_file" fi done < "$input_file" echo "Coverage filtered file created: $output_file" } # Check if at least one argument is provided if [ "$#" -lt 1 ]; then echo "Usage: $0 pattern.txt" exit 1 fi # Process each file matching the pattern for file in "$@"; do for f in $file; do if [ -f "$f" ]; then filter_coverage "$f" else echo "File $f not found!" fi done done ``` ::: ``` #Make script executable chmod +x ./filter_alleles_10count_pt3.sh #Execute it on all files in directory ./filter_alleles_10count_pt3.sh *.filtered_2alleles.txt ``` ### 4d.) Remove alleles that have <20% of total coverage - i.e. mark as homozygous :::spoiler filter_alleles_20percent_pt4.sh ``` #!/bin/bash # Iterate over each *.filtered_2alleles_10count.txt file in the directory for input_file in *.filtered_2alleles_10count.txt; do # Define a temporary output file output_file="${input_file%.txt}_20%.txt" # Initialize variables current_loci="" total_coverage=0 coverage_threshold=0 allele_lines=() # Process the input file while IFS= read -r line; do if [[ "$line" =~ ^Loci: ]]; then # When encountering a new locus, process the previous locus if it exists if [[ -n "$current_loci" ]]; then # Write the locus and genotype to the output file echo "$current_loci" >> "$output_file" echo "$genotype_line" >> "$output_file" # Filter and write allele lines based on the coverage threshold for allele_line in "${allele_lines[@]}"; do coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}') if [[ "$coverage" -ge "$coverage_threshold" ]]; then echo "$allele_line" >> "$output_file" fi done # Reset variables for the next locus total_coverage=0 coverage_threshold=0 allele_lines=() fi # Update current locus current_loci=$(echo "$line" | sed 's/Loci: //') genotype_line="" elif [[ "$line" =~ ^Genotype: ]]; then # Store genotype line genotype_line="$line" elif [[ "$line" =~ ^Allele: ]]; then # Process allele line to extract coverage allele_lines+=("$line") coverage=$(echo "$line" | awk -F', Coverage: ' '{print $2}') total_coverage=$((total_coverage + coverage)) elif [[ -z "$line" ]]; then # Handle empty lines (end of a locus block) if [[ -n "$current_loci" ]]; then # Calculate coverage threshold (20%) coverage_threshold=$((total_coverage * 20 / 100)) # Write the locus and genotype to the output file echo "$current_loci" >> "$output_file" echo "$genotype_line" >> "$output_file" # Filter and write allele lines based on the coverage threshold for allele_line in "${allele_lines[@]}"; do coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}') if [[ "$coverage" -ge "$coverage_threshold" ]]; then echo "$allele_line" >> "$output_file" fi done # Reset variables for the next locus current_loci="" total_coverage=0 coverage_threshold=0 allele_lines=() fi fi done < "$input_file" # Handle the last locus in the file if not followed by an empty line if [[ -n "$current_loci" ]]; then coverage_threshold=$((total_coverage * 20 / 100)) echo "$current_loci" >> "$output_file" echo "$genotype_line" >> "$output_file" for allele_line in "${allele_lines[@]}"; do coverage=$(echo "$allele_line" | awk -F', Coverage: ' '{print $2}') if [[ "$coverage" -ge "$coverage_threshold" ]]; then echo "$allele_line" >> "$output_file" fi done fi done ``` ::: ``` #Make script executable chmod +x ./filter_alleles_20percent_pt4.sh #Execute it on all files in directory ./filter_alleles_20percent_pt4.sh ``` ### 4e.) Changes final genotype :::spoiler filter_alleles_changegeno_pt5.sh ``` #!/bin/bash # Process each file matching the pattern this is a temporary file that contains duplicate genotype rows for file in *.filtered_2alleles_10count_20%.txt; do # Create a new file to store the processed data output_file="${file%.txt}_updated.txt" # Initialize variables prev_line="" current_line="" last_line="" # Process each line of the input file while IFS= read -r current_line; do # Reset $*numerics for the current line first_numeric="" second_numeric="" # Check if the line is a loci row if [[ $current_line != "Allele:"* && $current_line != "Genotype:"* ]]; then # If the previous line is a genotype row, then print it as "NA:NA" if [[ $prev_line == "Genotype:"* ]]; then echo "Genotype: NA:NA" >> "$output_file" echo "Loci: $current_line" >> "$output_file" else #if previous row was an allele row, then just print the loci echo "Loci: $current_line" >> "$output_file" fi # Check if current row is an allele row elif [[ $current_line == "Allele:"* ]]; then # Extract the alleles using grep first_numeric=$(echo "$current_line" | grep -oE '[0-9]+' | head -n 1) second_numeric=$(echo "$prev_line" | grep -oE '[0-9]+' | head -n 1) # Check if previous row was an allele row, aka loci if heterozygous. if [[ $prev_line == "Allele:"* ]]; then echo "Genotype: $second_numeric:$first_numeric" >> "$output_file" else echo "Genotype: $first_numeric:$first_numeric" >> "$output_file" fi fi # Update the previous line to the current line for the next iteration prev_line="$current_line" done < "$file" #Define last_line as the last line of the file while IFS= read -r line; do last_line="$line" done < "$file" #Print last line of input file if it starts with Genotype: (i.e. missing allele) if [[ $last_line == "Genotype:"* ]]; then echo "Genotype: NA:NA" >> "$output_file" fi echo "Updated file saved as $output_file" done #We need to clean-up the final file for file in *_updated.txt; do # Create a new file to store the processed data base_name=$(basename "$file" ".filtered_2alleles_10count_20%_updated.txt") output_file2="${base_name}_finalgeno.txt" # Initialize variables prev_line="" current_line="" # Process each line of the input file while IFS= read -r current_line; do #Check if line is a loci line if [[ $current_line != "Allele:"* && $current_line != "Genotype:"* ]]; then echo "$prev_line" >> "$output_file2" echo "$current_line" >> "$output_file2" fi # Update the previous line to the current line for the next iteration prev_line="$current_line" done < "$file" # Print the last line of file into output tail -n 1 "$file" >> "$output_file2" echo "Updated file saved as $output_file2" done #removes the first temp files from the script rm *updated.txt ``` ::: ``` #Make script executable chmod +x ./filter_alleles_changegeno_pt5.sh #Execute it on all files in directory ./filter_alleles_changegeno_pt5.sh *.filtered_2alleles_10count_20%.txt ``` ### 4e.) Output final .csv :::spoiler create_csv.sh ``` #!/bin/bash # Define output CSV file output_csv="combined_genotypes.csv" # Temporary files temp_loci_file=$(mktemp) temp_genotypes_file=$(mktemp) # Initialize the output CSV file echo "" > "$output_csv" # Collect all unique loci for txt_file in *_finalgeno.txt; do awk ' BEGIN { FS=": "; } $1 == "Loci" { loci = $2; if (loci !~ /^[[:space:]]*$/) { loci_list[loci] = ""; } } END { for (loci in loci_list) { print loci; } } ' "$txt_file" >> "$temp_loci_file" done # Create unique list of loci sort -u "$temp_loci_file" > "$temp_loci_file.sorted" loci_list=$(cat "$temp_loci_file.sorted") # Add loci as headers to the output CSV echo -n "Sample," > "$output_csv" echo "$loci_list" | tr '\n' ',' >> "$output_csv" echo "" >> "$output_csv" # Debugging: Print the list of loci headers echo "Loci headers: $(cat "$temp_loci_file.sorted")" >&2 # Populate the CSV with genotypes for txt_file in *_finalgeno.txt; do sample_name=$(basename "$txt_file" _finalgeno.txt) # Initialize an array for genotypes genotype_row=() # Debugging: Print current sample being processed echo "Processing sample: $sample_name" >&2 # Loop through each loci to find genotypes while IFS=, read -r loci; do genotype=$(awk -F, -v loci="$loci" ' BEGIN { FS=": "; } $1 == "Loci" { current_loci = $2; } $1 == "Genotype" && current_loci == loci { print $2; exit; } END { # print "NA"; } ' "$txt_file") # Debugging: Print the locus and genotype found echo "Locus: $loci, Genotype: $genotype" >&2 genotype_row+=("$genotype") done < "$temp_loci_file.sorted" # Debugging: Print the genotype row for the current sample echo "Genotype row for $sample_name: ${genotype_row[@]}" >&2 # Append the sample row to the CSV echo -n "$sample_name," >> "$output_csv" echo "${genotype_row[@]}" | tr ' ' ',' >> "$output_csv" done # Clean up temporary files rm -f "$temp_loci_file" "$temp_loci_file.sorted" ``` ::: ``` #Make script executable chmod +x ./create_csv.sh #Execute it on all files in directory ./create_csv.sh ``` download .csv file and analyze in R ## Notes 8/18 ### Things to look into: 1.) bcftools mpileup -d 8000 : What to set -d to. Should I lower it. What about other filters. Resource: https://www.biostars.org/p/9466154/ 2.) bcftools mpileup: Do I want to specify search region (i.e. before repeat motif) or leave this blank? Maybe try on large subset of samples 3.) bcftools mpileup: Should I just search for indels here? Yes! Want to filter .VCF for just indels 4.) Realigner Target Creator: Do I want to run this with all files. Gut says yes 5.) indelrealigner - some loci have too many reads. how to deal? 6.) Repeatseq .regions file. Include more microsats? Mess with motifs ## PopGen Analysis in R ### R Tutoritals: https://tomjenkins.netlify.app/tutorials/r-popgen-getting-started/ https://julie-hopper.com/2018/12/08/population-genetics-part-iii-data-analysis-data-wrangling-interpretation-and-implications/ :::spoiler IGNORE: How to merge vcfs ### Preparing VCF for R import First we need to merge all VCF files (repeatseq does 1 per individual) into one using vcftools. #### Installing vcftools ``` conda create -n vcftools conda activate vcftools conda install bioconda::vcftools conda install bioconda::samtools conda install bioconda::tabix ``` #### Merging VCF vcf-merge requires bgzipped and tabix indexed VCF files on input. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz) ##### Compressing ``` conda activate bowtie2 INDS=($(for i in ./*realigned.bam.vcf; do echo $(basename ${i%_r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}_realigned.bam.vcf OUT=./${IND}.vcf.gz echo "Compressing $IND" bgzip -c $INP > $OUT ; tabix -p vcf $OUT done ``` ##### Putting sample name in vcf ``` conda activate bowtie2 INDS=($(for i in ./*vcf.gz; do echo $(basename ${i%.v*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}.vcf.gz OUT=./${IND}.renamed.vcf.gz echo "renaming sample $IND" echo $IND > ${IND}.txt bcftools reheader -s ${IND}.txt $INP -o $OUT rm ${IND}.txt done ``` ##### Indexing ``` INDS=($(for i in ./*renamed.vcf.gz; do echo $(basename ${i%.r*}); done)) for IND in ${INDS[@]}; do # declare variables INP=./${IND}.renamed.vcf.gz echo "indexing $IND" tabix -p vcf $INP done ``` #### Merging Don't include empty VCF's in the following command. Use ls -lhS to identify empty VCF's by their small file size ``` bcftools merge 109-F4-BU-4.renamed.vcf.gz \ 66B-F6-BU-6.renamed.vcf.gz \ 68-F3-BU-6.renamed.vcf.gz \ 69-F7-BU-5.renamed.vcf.gz \ 91-F1-GH-14.renamed.vcf.gz \ 109-F3-BU-3.renamed.vcf.gz \ 70A-F1-GH-7.renamed.vcf.gz \ 44-F1-GH-30.renamed.vcf.gz \ 44-F1-GH-18.renamed.vcf.gz \ 84_Appo.renamed.vcf.gz \ 104-F3-BU-7.renamed.vcf.gz \ 125B-F12-BU-1.renamed.vcf.gz \ 29-F4-GH-1.renamed.vcf.gz \ 28-F8-BU-1.renamed.vcf.gz \ 109-F3-BU-1.renamed.vcf.gz \ 104-F2-GH-8.renamed.vcf.gz \ 65-F3-BU-4.renamed.vcf.gz \ 125-126_NF1.renamed.vcf.gz \ 63-65_NF1.renamed.vcf.gz \ 125B-F12-BU-2.renamed.vcf.gz \ 125B_Flav.renamed.vcf.gz \ 44-F2-BU-4.renamed.vcf.gz \ 69-F6-BU-6.renamed.vcf.gz \ 125B-F11-GH-24.renamed.vcf.gz \ 91-92_NF1.renamed.vcf.gz \ 92-F1-GH-2.renamed.vcf.gz \ 1-F7-BU-10.renamed.vcf.gz \ 45-F2-GH-22.renamed.vcf.gz \ 70A-F1-GH-36.renamed.vcf.gz \ 69-F5-GH-25.renamed.vcf.gz \ 92-F1-GH-5.renamed.vcf.gz \ 104-F2-GH-44.renamed.vcf.gz \ 28-29_NF1.renamed.vcf.gz \ 125B-F13-BU-3.renamed.vcf.gz \ 29-F4-GH-13.renamed.vcf.gz \ 44-F2-GH-7.renamed.vcf.gz \ 66B-F6-BU-7.renamed.vcf.gz \ 44-F1-GH-21.renamed.vcf.gz \ 60-F7-BU-9.renamed.vcf.gz \ 70B-F2-GH-1.renamed.vcf.gz \ 60-F6-BU-6.renamed.vcf.gz \ 28-F8-BU-4.renamed.vcf.gz \ 60-F6-BU-5.renamed.vcf.gz \ 64-F6-BU-3.renamed.vcf.gz \ 104-F3-BU-6.renamed.vcf.gz \ 101-F1-GH-10.renamed.vcf.gz \ 52-F3-GH-15.renamed.vcf.gz \ 108-109_NF1.renamed.vcf.gz \ 92-F1-GH-3.renamed.vcf.gz \ 109-F3-BU-2.renamed.vcf.gz \ 188_Control.renamed.vcf.gz \ 64-F6-BU-1.renamed.vcf.gz \ 3-F6-GH-12.renamed.vcf.gz \ 44-F2-BU-5.renamed.vcf.gz \ 60-F5-GH-25.renamed.vcf.gz \ 127_Flav.renamed.vcf.gz \ 71-F2-GH-28.renamed.vcf.gz \ 7_Flav.renamed.vcf.gz \ 168_Control.renamed.vcf.gz \ 44-F2-BU-3.renamed.vcf.gz \ 109-F3-BU-4.renamed.vcf.gz \ 70A-F1-GH-1.renamed.vcf.gz \ 172_Control.renamed.vcf.gz \ 60-F6-BU-7.renamed.vcf.gz \ 69-F7-BU-6.renamed.vcf.gz \ 45-F2-GH-17.renamed.vcf.gz \ 109-F3-BU-7.renamed.vcf.gz \ 52-F5-BU-4.renamed.vcf.gz \ 3-F8-GH-4.renamed.vcf.gz \ 66B-F6-BU-3.renamed.vcf.gz \ 100-104_NF1.renamed.vcf.gz \ 45-F2-GH-29.renamed.vcf.gz \ 45-F2-GH-7.renamed.vcf.gz \ 44-F1-GH-17.renamed.vcf.gz \ 104-F3-BU-3.renamed.vcf.gz \ 125B-F11-GH-8.renamed.vcf.gz \ 125B-F13-BU-1.renamed.vcf.gz \ 92-F1-GH-8.renamed.vcf.gz \ 66B-F6-BU-2.renamed.vcf.gz \ 187_Control.renamed.vcf.gz \ 186_Control.renamed.vcf.gz \ 102-F2-GH-20.renamed.vcf.gz \ 52-F3-GH-8.renamed.vcf.gz \ 60-F7-BU-8.renamed.vcf.gz \ 68-F3-BU-9.renamed.vcf.gz \ 125B-F13-BU-7.renamed.vcf.gz \ 175_Control.renamed.vcf.gz \ 101-F1-GH-13.renamed.vcf.gz \ 125B-F11-GH-20.renamed.vcf.gz \ 126-F13-GH-11.renamed.vcf.gz \ 183_Control.renamed.vcf.gz \ 101-F1-GH-23.renamed.vcf.gz \ 70B-F2-GH-23.renamed.vcf.gz \ 2_3_141_142_NF1.renamed.vcf.gz \ 65-F2-GH-27.renamed.vcf.gz \ 104-F3-BU-8.renamed.vcf.gz \ 28-F8-BU-3.renamed.vcf.gz \ 91-F2-BU-5.renamed.vcf.gz \ 70B-F2-GH-33.renamed.vcf.gz \ 102_Appo.renamed.vcf.gz \ 3-F6-GH-13.renamed.vcf.gz \ 34_Appo_Flav.renamed.vcf.gz \ 102-F2-GH-17.renamed.vcf.gz \ 162_Control.renamed.vcf.gz \ 181_Control.renamed.vcf.gz \ 45-F2-GH-1.renamed.vcf.gz \ 52-F5-BU-5.renamed.vcf.gz \ 3-F6-GH-4.renamed.vcf.gz \ 66B-F7-BU-1.renamed.vcf.gz \ 109-F3-BU-5.renamed.vcf.gz \ 125B-F13-BU-8.renamed.vcf.gz \ 173_Control.renamed.vcf.gz \ 34_NF1.renamed.vcf.gz \ 66-69_NF1.renamed.vcf.gz \ 66B-F6-BU-8.renamed.vcf.gz \ 44-F1-GH-29.renamed.vcf.gz \ 52-F4-GH-17.renamed.vcf.gz \ 126_Flav.renamed.vcf.gz \ 70B-F3-GH-1.renamed.vcf.gz \ 2-F2-GH-7.renamed.vcf.gz \ 91-F2-BU-9.renamed.vcf.gz \ 71-F2-GH-25.renamed.vcf.gz \ 91-F1-GH-19.renamed.vcf.gz \ 101-F3-BU-1.renamed.vcf.gz \ 170_Control.renamed.vcf.gz \ 4-F7-BU-2.renamed.vcf.gz \ 60-F7-BU-5.renamed.vcf.gz \ 65-F3-BU-3.renamed.vcf.gz \ 104-F2-GH-7.renamed.vcf.gz \ 60_Flav.renamed.vcf.gz \ 52-F3-GH-17.renamed.vcf.gz \ 103_Appo.renamed.vcf.gz \ 71-F2-GH-16.renamed.vcf.gz \ 101-F1-GH-6.renamed.vcf.gz \ 70B-F2-GH-4.renamed.vcf.gz \ 104-F3-BU-1.renamed.vcf.gz \ 104-F3-BU-5.renamed.vcf.gz \ 52_NF1.renamed.vcf.gz \ 44-F2-BU-6.renamed.vcf.gz \ 70A-F1-GH-5.renamed.vcf.gz \ 125B-F11-GH-16.renamed.vcf.gz \ 165_Control.renamed.vcf.gz \ 66_Flav.renamed.vcf.gz \ 28-F10-BU-1.renamed.vcf.gz \ 3-F6-GH-3.renamed.vcf.gz \ 45-F2-GH-2.renamed.vcf.gz \ 52-F3-GH-21.renamed.vcf.gz \ 45-F2-GH-33.renamed.vcf.gz \ 60_NF1.renamed.vcf.gz \ 91-F2-BU-4.renamed.vcf.gz \ 8_NF1.renamed.vcf.gz \ 70-71_NF1.renamed.vcf.gz \ 125B-F13-BU-6.renamed.vcf.gz \ 109-F4-BU-3.renamed.vcf.gz \ 69-F6-BU-4.renamed.vcf.gz \ 92-F2-GH-15.renamed.vcf.gz \ 102-F2-GH-9.renamed.vcf.gz \ 44-F1-GH-35.renamed.vcf.gz \ 64-F6-BU-5.renamed.vcf.gz \ 109-F4-BU-1.renamed.vcf.gz \ 185_Control.renamed.vcf.gz \ 52-F5-BU-7.renamed.vcf.gz \ 184_Control.renamed.vcf.gz \ 125B-F10-GH-3.renamed.vcf.gz \ 100_Appo.renamed.vcf.gz \ 125B-F13-BU-5.renamed.vcf.gz \ 52-F5-BU-6.renamed.vcf.gz \ 70A-F1-GH-4.renamed.vcf.gz \ 101_Appo.renamed.vcf.gz \ 28-F8-BU-2.renamed.vcf.gz \ 91-F2-BU-6.renamed.vcf.gz \ 92-F1-GH-6.renamed.vcf.gz \ 102-F2-GH-16.renamed.vcf.gz \ 101-F1-GH-20.renamed.vcf.gz \ 70B-F2-GH-20.renamed.vcf.gz \ 91-F2-BU-7.renamed.vcf.gz \ 69-F6-BU-7.renamed.vcf.gz \ 70B-F2-GH-24.renamed.vcf.gz \ 91_Appo.renamed.vcf.gz \ 65-F3-BU-1.renamed.vcf.gz \ 69-F7-BU-7.renamed.vcf.gz \ 109-F4-BU-5.renamed.vcf.gz \ 125B-F11-GH-3.renamed.vcf.gz \ 3-F8-BU-3.renamed.vcf.gz \ 60-F6-BU-4.renamed.vcf.gz \ 63_Flav.renamed.vcf.gz \ 91-F1-GH-13.renamed.vcf.gz \ 92-F1-GH-4.renamed.vcf.gz \ 102-F2-GH-7.renamed.vcf.gz \ 28-F9-BU-1.renamed.vcf.gz \ 70B-F2-GH-19.renamed.vcf.gz \ 44-45_NF1.renamed.vcf.gz \ 60-F7-BU-6.renamed.vcf.gz \ 52-F5-BU-1.renamed.vcf.gz \ 83-84_NF1.renamed.vcf.gz \ 92-F1-GH-7.renamed.vcf.gz \ 71_Appo.renamed.vcf.gz \ 66B-F6-BU-4.renamed.vcf.gz \ 70A-F1-GH-2.renamed.vcf.gz \ 91-F2-BU-3.renamed.vcf.gz \ 70A_Appo.renamed.vcf.gz \ 34A-F1-GH-9.renamed.vcf.gz \ 69-F6-BU-5.renamed.vcf.gz \ 52-F3-GH-11.renamed.vcf.gz \ 60-F6-BU-3.renamed.vcf.gz \ 125B-F13-BU-2.renamed.vcf.gz \ 52-F5-BU-2.renamed.vcf.gz \ 67_Flav.renamed.vcf.gz \ -o merged.vcf # If you want to compress it bgzip merged.vcf > merged.vcf.gz ``` ::: ### R: VCF -> GENind -> GENpop We can now download the merged vcf file, import it into R, and using some packages turn it into a genepop. ##### Load Packages ``` library(adegenet) library(poppr) library(dplyr) library(hierfstat) library(reshape2) library(ggplot2) library(RColorBrewer) library(scales) library(vcfR) ``` ##### Turn into Genind When importing, may want to try convertNA(), which converts missing data from "." to "NA". See vcfR manual, pg 48 https://cran.r-project.org/web/packages/vcfR/vcfR.pdf For users of poppr: If you wish to use vcfR2genind(), it is strongly recommended to use it with the option return.alleles = TRUE. The reason for this is because the poppr package accomodates mixed-ploidy data by interpreting "0" alleles in genind objects to be NULL alleles in both poppr::poppr.amova() and poppr::locus_table(). ``` #Import VCF d8000_merged.vcf<-read.vcfR("merged.vcf", checkFile = TRUE, check_keys = TRUE, verbose = TRUE ) # Turn into genind d8000_gid <- vcfR2genind(d8000_merged.vcf, sep = "[|/]", return.alleles = TRUE) ``` ##### Assigning populations Make a tab de-lineated .txt file with sample in one column, and population in the other. example: ``` 109-F4-BU-4 Bombus appositus 66B-F6-BU-6 Bombus flavifrons 68-F3-BU-6 Bombus flavifrons 69-F7-BU-5 Bombus flavifrons 91-F1-GH-14 Bombus appositus 109-F3-BU-3 Bombus appositus 70A-F1-GH-7 Bombus appositus 44-F1-GH-30 Bombus appositus 44-F1-GH-18 Bombus appositus 84_Appo parent ``` ``` #remove ".realigned.vcf.gz" from sample names in Excel =LEFT(I2, LEN(I2) -15) #To match pollinator to sample number =INDEX(PollinatorRange,MATCH(Sample,SampleRange,0)) =INDEX(K$2:K$265,MATCH(Z1,J$2:J$265,0)) ```