- Date: July 2023
- Author: Dylan Cohen
A lot of what I do comes from these tutorials. I recommend going thru one of them or both. They both show how to create some useful graphs in R to determine what parameters to filter on with vcftools.
**Note the github one comes from ipyrad output so there will be a few differences from STACKS**
https://speciationgenomics.github.io/filtering_vcfs/
https://github.com/cheyennelmoore/Erigenia-Analyses
---
# VcfTools
(https://vcftools.sourceforge.net/)
Vcftools is used to further clean your data. You will want to remove samples with a lot of missing data if you have not already done so. This can be done easily with vcftools. It is critical to try and eliminate as much missing data for STRUCTURE/ADMIXTURE analyses, less crtical for genetic diversity and inbreeding.
### Example then dummy code
```vcftools --remove-indv B162fq --remove-indv T4fq --remove-indv B59fq --remove-indv T8fq --remove-indv B24fq --remove-indv R15fq --remove-indv P21fq --remove-indv P26fq --vcf comb_85D.vcf --recode --out comb85D.re.vcf ```
```vcftools --remove-indv SAMPLEID --vcf input.vcf --recode --out new.name.vcf```
Below is a oneliner that I have used to filter some of my data sets. This allows for up to 50% missing data. Refer to the vcftools guide to figure out what the rest of the parameters do. For analyses using STRUCTURE or ADMIXTURE I generally filter to about 90-95% data present (meaning allowing only 5-10% missinginess in other words --max-missing 0.95 or 0.90)
```vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf```
## Genetic diversity and Inbreeding R tutorial
ibrary(hierfstat)
library(gaston)
library(vcfR)
library(reshape2)
library(pegas)
library(cowplot)
library(adegenet)
library(dartR)
setwd()
## Read in tharpii vcf file and convert it to genid
vcf<-read.vcfR("./min_75_tharp.recode.vcf")
tharp_genind<- vcfR2genind(vcf)
#What are the order of the populations/samples in the vcf file?
colnames(vcf@gt)
## Assign samples to populaitons
pop(tharp_genind)<-as.factor(c("CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CPC","CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN"))
#list pop names
popNames(tharp_genind)
#convert to hfstat format
tharp_hfstat<-genind2hierfstat(tharp_genind,pop=NULL)
#Converts vcf to genlight object
tharp_gl <- vcfR2genlight(vcf)
###Assign samples to a population for genlight object
pop(tharp_gl)<-as.factor(c("CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CAP", "CPC","CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "CPC", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "TEX", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "RED", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN", "BEN"))
popNames(tharp_gl)
###Basic Stats for All Tharpii populations
basicstat <- basic.stats(tharp_hfstat, diploid=TRUE, digits=3)
overall_basic_stats <- as.data.frame(basicstat$overall)
overall_basic_stats
* This is a global estimate across all your data in the vcf file. If its only one pop then you do not need to divide it out like below. *
tharp_hfstat$pop
pops<- c("CAP","CPC","TEX","RED","BEN")
colnames(basicstat$Ho) <- pops
* Generate population statistics for each Tharpii population these are in bold because r code does not transfer well here
#### Ho <- colMeans(basicstat$Ho,na.rm=T)
#### He <- colMeans(basicstat$Hs,na.rm=T)
#### Fis<-colMeans(basicstat$Fis,na.rm = T)
#### new_popgen_stats <- cbind(Ho,He,Fis)
#### new_popgen_stats # stats for each Population
################################################################################
# DartR
library(dartR)
There are many extremely useful outputs from this package, a lot of them will overlap with other population genetics programs.
https://github.com/green-striped-gecko/dartR
Check your genlight file is ready for DartR
Ben_gl2 <- gl.compliance.check(Ben_gl)
# Private Alleles
gl.report.pa(Ben_gl2, method = 'one2rest')
# Basic Stats
gl.basic.stats(Ben_gl2)
gl.report.heterozygosity(Ben_gl,method = 'pop')
gl.report.heterozygosity(Ben_gl2,method = 'ind')
gl.test.heterozygosity(Ben_gl2, nreps = 1000)