Vcftools, hierfstat, and DartR

- Date: July 2023 - Author: Dylan Cohen A lot of what I do comes from these tutorials. I recommend going thru one of them or both. They both show how to create some useful graphs in R to determine what parameters to filter on with vcftools. **Note the github one comes from ipyrad output so there will be a few differences from STACKS** https://speciationgenomics.github.io/filtering_vcfs/ https://github.com/cheyennelmoore/Erigenia-Analyses --- # VcfTools (https://vcftools.sourceforge.net/) Vcftools is used to further clean your data. You will want to remove samples with a lot of missing data if you have not already done so. This can be done easily with vcftools. It is critical to try and eliminate as much missing data for STRUCTURE/ADMIXTURE analyses, less crtical for genetic diversity and inbreeding. ### Example then dummy code ```vcftools --remove-indv B162fq --remove-indv T4fq --remove-indv B59fq --remove-indv T8fq --remove-indv B24fq --remove-indv R15fq --remove-indv P21fq --remove-indv P26fq --vcf comb_85D.vcf --recode --out comb85D.re.vcf ``` ```vcftools --remove-indv SAMPLEID --vcf input.vcf --recode --out new.name.vcf``` Below is a oneliner that I have used to filter some of my data sets. This allows for up to 50% missing data. Refer to the vcftools guide to figure out what the rest of the parameters do. For analyses using STRUCTURE or ADMIXTURE I generally filter to about 90-95% data present (meaning allowing only 5-10% missinginess in other words --max-missing 0.95 or 0.90) ```vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf``` ## Genetic diversity and Inbreeding R tutorial ibrary(hierfstat) library(gaston) library(vcfR) library(reshape2) library(pegas) library(cowplot) library(adegenet) library(dartR) setwd() ## Read in tharpii vcf file and convert it to genid vcf<-read.vcfR("./min_75_tharp.recode.vcf") tharp_genind<- vcfR2genind(vcf) #What are the order of the populations/samples in the vcf file? colnames(vcf@gt) ## Assign samples to populaitons pop(tharp_genind)<-as.factor(c("CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CPC","CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN")) #list pop names popNames(tharp_genind) #convert to hfstat format tharp_hfstat<-genind2hierfstat(tharp_genind,pop=NULL) #Converts vcf to genlight object tharp_gl <- vcfR2genlight(vcf) ###Assign samples to a population for genlight object pop(tharp_gl)<-as.factor(c("CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CPC","CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN")) popNames(tharp_gl) ###Basic Stats for All Tharpii populations basicstat <- basic.stats(tharp_hfstat, diploid=TRUE, digits=3) overall_basic_stats <- as.data.frame(basicstat$overall) overall_basic_stats * This is a global estimate across all your data in the vcf file. If its only one pop then you do not need to divide it out like below. * tharp_hfstat$pop pops<- c("CAP","CPC","TEX","RED","BEN") colnames(basicstat$Ho) <- pops * Generate population statistics for each Tharpii population these are in bold because r code does not transfer well here #### Ho <- colMeans(basicstat$Ho,na.rm=T) #### He <- colMeans(basicstat$Hs,na.rm=T) #### Fis<-colMeans(basicstat$Fis,na.rm = T) #### new_popgen_stats <- cbind(Ho,He,Fis) #### new_popgen_stats # stats for each Population ################################################################################ # DartR library(dartR) There are many extremely useful outputs from this package, a lot of them will overlap with other population genetics programs. https://github.com/green-striped-gecko/dartR Check your genlight file is ready for DartR Ben_gl2 <- gl.compliance.check(Ben_gl) # Private Alleles gl.report.pa(Ben_gl2, method = 'one2rest') # Basic Stats gl.basic.stats(Ben_gl2) gl.report.heterozygosity(Ben_gl,method = 'pop') gl.report.heterozygosity(Ben_gl2,method = 'ind') gl.test.heterozygosity(Ben_gl2, nreps = 1000)