###### tags: Übung course: Exercises on the role of evolution in biology - Module The Central Role of Evolution in Biosciences - University of Potsdam *Marisol Domínguez* # CONSERVATION GENETICS ## Goals of the tutorial This is a hands-on tutorial that will guide you in a step-by-step manner to perform population genetics analysis and simulate dynamics under different evolutionary forces. During this practical we will learn how to analyze NGS genetic data, and discuss the outcome of simulating populations under genetic drift and natural selection. *[Previous knowledge in population genetics and evolution by natural selection theories is assumed.]* # PART 1: virtual lab ![](https://hackmd.io/_uploads/ryRq3BMDn.png) ## Objective Examine how violating the assumptions of the Hardy-Weinberg equilibrium affects the population genetics of a virtual population of fishes. We will put in practice concepts of Hardy-Weinberg equilibrium and population genetic theory by simulating a virtual population of koi fish and following their allele and genotype frequencies during several generations. ![](https://hackmd.io/_uploads/BkgT3SMDh.png) <span style="color:blue">Which type of genetic inheritance is it? </span> Access to virtual lab: http://virtualbiologylab.org/ModelsHTML5/PopGenFishbowl/PopGenFishbowl.html ## Evolutionary Scenario: Simulating Random Genetic Drift Read the Background Information and Tutorial pages. Go to Run Experiments and observe the default parameters. <span style="color:blue">How do you expect the population to respond under Hardy-Weinberg equilibrium? *(we are not testing it because we can not set pop size = infinite)* </span> Leave all parameters at their default values, but make the population smaller. Change the Initial Size, and Carrying Capacity to 50. Write down the initial allele and genotype frequencies. Let the fish mate and die for ~400 generations (you can speed up the simulations up to 16X). Collect data regarding allele and genotype proportions by clicking on the bottom "**To Data**" ![image](https://hackmd.io/_uploads/SJmTmn_f0.png) ![image](https://hackmd.io/_uploads/rJgqUndfR.png) <span style="color:blue">What happened overall to the allele and genotype frequencies? Is there a significant change compared with the initial proportions? </span> <span style="color:blue"> Does the model respond in the way you expected? If not, why that might be? </span> Reset the “Run Experiments” parameters back to the original settings (spiral arrow). Repeat the simulation with the exact same conditions 2 more times: <span style="color:blue"> Are the results (allele and genotype proportions) always the same? </span> ## Evolutionary Scenario: Simulating Natural Selection Go back to the Design screen. Change parameters back to default values (except initial population size and carrying capacity - which we want to the maximum) and **reduce the fitness of genotype RR** to a value of your choice. Now being white or spotted confers a selective advantage (camouflage?). Let the simulation run >100 generations and observe what happens <span style="color:blue"> How is the genotype proportion for RR? Did Rr increase, decrease or stay the same? </span> # PART 2: Population Structure Analysis ## Software needed * Latest version of R (and Rstudio if you use it). To check current installed version of R type in the console: `R.version.string` To check current version of Rstudio: `Help -> About Rstudio` * SambaR R package and its dependencies To install SambaR: 1. go to it's [github website](https://github.com/mennodejong1986/SambaR) and click on the green bottom "Code" -> Downoad ZIP. 2. Unzip the folder, and follow the instructions in the [SambaR manual](https://github.com/mennodejong1986/SambaR/blob/master/sambaR_manual.docx). 3. Extract files to a local folder named SambaR in a directory of your choice (hereafter called "path"). Load SambaR in R: 1. Open a new R session 2. Load SambaR: `source("C:/path/SambaR/SAMBAR_v1.10.txt")` If it worked you shoud see in R console the message: *"SambaR loaded."* 3. Install local dependences (need at least 2Gb disk space for dependences;warnings are allowed but check for errors in the console): `getpackages()` *Please refer to the manual if you encounter any error messages.* *Hint: If you copy/paste the path, be aware that R uses forward slashes rather than backward slashes.* # Demo: polar bears To check that SambaR was succesfully installed, we will run the example data with sequencing data of **polar bears** *(Ursus maritimus)*. ![](https://hackmd.io/_uploads/BySRj-mwh.png) *Sequencing data was produced by RADseq method and published in: [Viengkone et al. (2017)](https://onlinelibrary.wiley.com/doi/10.1002/ece3.2563).* Assuming **SambaR** and **dependences** are already **loaded**, then: Check folder with infile: `list.files("C:/path/SambaR/example_data")` Set the path to where the infiles are: `setwd("C:/path/SambaR/example_data")` Load the sample file: `importdata(inputprefix="polarbear",sumstatsfile=FALSE,depthfile=FALSE)` *Check the **console's messages** to understand what is going on. Observe that a new **output folder** was generated in your working directory. Observe also the **data objects** that R generated.* ![](https://hackmd.io/_uploads/SyqHTpZv2.png) To obtain a summary of your data: ``` mygenlight ``` <span style="color:blue">How many polar bear individuals are included in the dataset? How many SNPs? Which locations have been sampled? </span> Filter your data: ``` filterdata(indmiss=0.25,snpmiss=0.1,min_mac=2,dohefilter=TRUE) ``` *Hint: Explore in the output directory the new files that were generated.* <span style="color:blue">How many SNPs were retained? *(check console)* How many individuals were retained? </span> ### Private and shared alleles: Now we will investigate genetic variation within each location. Explore for how many loci the populations share the same allele compositions, and how many private (alleles unique to a location that are not found in other populations) there are. ``` markervenn(export=TRUE,plotname="Private_and_shared_alleles_venndiagram",minmaf=0,dofilter=TRUE,popnames=mysambar$populations,popcolours=mysambar$mycolours,silent=TRUE,numbercex=2.5,labelcex=2.5,addtotal=FALSE,include_overall=FALSE,doprop=FALSE) ``` *Hint: Explore in the example_data folder the plot generated*. <span style="color:blue">Which location shows the higher number of private alleles? Is there any population with only shared alleles? </span> ### Admixture barplots: We will run LEA (Landscape and Ecological Association Studies, Frichot & François 2015) which is a program, that just like STRUCTURE (Pritchard *et al.* 2000), calculates ancestry coefficients. The algorithms used by both programs are different (STRUCTURE uses a Bayesian approach, while LEA is a frequentist method), but the estimates of ancestry coefficients are similar and faster in LEA. LEA calculates the probability that an individual originates from each of the predefined genetic clusters (k), given the observed genotype of the individual and given the observed population allele frequencies. *Consider, as an analogy, two jars. Jar A contains 10 red and 90 white balls. Jar B contains 1 red and 99 white balls. A ball is drawn from one of these jars, and the ball is red. From which bottle has the ball been drawn? there is a 91% probability the ball has been drawn from jar A, and a 9% probability the ball has been drawn from jar B.* The program provides this information in qmatrices and in barplots. `runLEA(mindemes=2,maxdemes=4) LEAstructureplot(mindemes=2,maxdemes=4,export="pdf",dolabelcol=TRUE,addindname=TRUE) ` <span style="color:blue">What represents each bar of the plot, and each color? For k=4, and considering a threshold of 50%: Which percentage of the individuals from **SH** location are assigned to the same cluster? (*Hint: zoomming in the structure like plot to see individuals ID and opening the qmatrix in excel might help*). </span> ### Discriminant analysis of principal components (DAPC): DAPC ([Jombart et al. 2010](https://bmcgenomdata.biomedcentral.com/articles/10.1186/1471-2156-11-94)) is a multivariate method that identifies and describes clusters of genetically related individuals. It first performs a principal component analysis (PCA) in which the original variables are transformed in uncorrelated components to reduce dimensionality of the data. Then, it performs a Discriminant Analysis (DA) which classifies the data into distinct genetic clusters in a way to maximize between-group genetic variance while minimizing within-group variance. ``` adegenet_dapc(maxclusters=4,export="pdf",addellipse=TRUE,ellipselevel=0.9,use_genind=FALSE,addlegend=TRUE,legendpos="topleft",legendcex=0.8,symbolsize=3,plotheatmap=FALSE) ``` This produced many scatter plots, both using and ignoring a priori defined population assignment as input. The more interesing is probably: ``` dapc_scatter(addlegend=TRUE,export=NULL,symbolsize=1,legendpos="topright",legendcex=0.5,titlesize=0.5) ``` In this plots, each dot represents an individual. <span style="color:blue">According to this results, is there any overlap between clusters? </span> ### Fixation index (FST): FST is a are a measure of genetic differentiation among populations due to genetic structure. It is calculated comparing the genetic diversity within populations to the total genetic diversity across populations. FST values can range from 0 to 1: An FST of 0 indicates no genetic differentiation (the populations share the same allele frequencies). An FST of 1 indicates that the two populations are completely genetically distinct (no shared alleles). There are different FST estimators. **Nei's FST estimator** (Nei 1973) compares measures of heterozygosity within and between populations (it measures the reduction of heterozygosity in a subpopulation compared to the total population). **Weir & Cockerham FST estimator** (Weir & Cockerham 1984) is more robust to small/unbalanced sample sizes (because it uses an ANOVA to partition the total genetic variance into within-population, between-population, and between-individual components). To calculate pairwise-FST values for all combination of locations: ``` runstamppFst(n_boots=100,export="pdf") ``` <span style="color:blue">Which pair of locations differ the most (showed the highest values FST values)? </span> # Case study: Sea turtles conservation - *at home* ![](https://hackmd.io/_uploads/ryp8b6Wwn.png) ## Background knowledge Six out of the seven species of sea turtles are considered endangered, critically endangered or vulnerable (IUCN). Hybridization between sea turtle species occurs with relatively high frequency, even between very diverged species. Hawksbill males (*Eretmochelys imbricata*) and loggerhead females (*Caretta caretta*) are known to hybridize and reproduce viable hybrid offspring. Adults from these two parental species are morphologically different but hatchlings are hard to differentiate. We will use genomic information from 24 sea turtles (includings adults and hatchlings) sampled in Brazil, and generated in the publication Arantes *et al*. 2020 to analyze the most probable ancestry of hatchlings and hypothesize their status as species-specific or hybrids. ![](https://hackmd.io/_uploads/B10-zaZDn.png) ![](https://hackmd.io/_uploads/rkTnAQ7Ph.png) In [moodle](https://moodle2.uni-potsdam.de/course/view.php?id=37152) you will find a Data Folder containing the files with the genetic information (populationsSnps.bim, populationsSnps.raw, populations3.txt). File populations3.txt contains information of each sample. ## Objective Assign a species-specific or hybrid status to hatchlings of unknown ancestry. For this, infer the coefficient of admixture of the 14 hatchlings of unknown ancestry. ## Files Files are in RAW and BIM format, which are the binary versions of PED and MAP format. Data files have been pre-processed and filter prior to this practical. Basically, a VCF (Variant Call Fotmat) file containing genetic sequence variations (metadata + position, ID and quality information for each genetic variant in each sample) was converted into PED and MAP format using vcftools (Danecek et al. 2011), and these into RAW and BIM using PLINK (Purcell et al. 2007). Download them to a folder of your choice ## Analysis You will not need to run the exact same commands as with the polar bears' demo as the objective is different. [*But hopefully the demo was useful to get familiar with SambaR and have an overview its capacities.*] Open a new R session, load SambaR nad its dependencies: ``` source("C:/path/SambaR/SAMBAR_v1.09.txt") getpackages() ``` Set the path to where the infiles are: ``` setwd("C:/path") importdata(inputprefix="populationsSnps", samplefile="populations3.txt") ``` Inspect data object generated. <span style="color:blue">How many hatchlings of unknown species/hybrid status are present in the dataset? </span> Filtering and data quality check: `filterdata(indmiss=1,snpmiss=1,min_mac=0,dohefilter=FALSE,min_spacing=0,snpdepthfilter=FALSE,do_distplot=FALSE)` *Note: Eventhough the data was previoulsy filtered, this step is mandatory as it also performs data quality control . It actually does not remove information, but instead adds columns to the snps and inds objects.* <span style="color:blue">Do some individuals show unusually high genome-wide HE levels? How many? Why could that be? </span> **Population structure analysis** Principal Component Analysis (PCA) and Principal Coordinates Analysis (PCoA) are two ordination techniques used for multivariate analysis that reduce the complexity of high-dimensional data while preserving the most important structure or variation in the data. This allows visualization of the distances between samples in a low-dimensional space (PCoA uses (dis)similarity matrices). ``` snprelate_pca(export="pdf",axis1=1,axis2=2,do_mirror=c(FALSE,FALSE),addlegend=TRUE,legendpos="topright",legendcex=1.5,symbolsize=2,dodc=FALSE,labels=TRUE) ape_pcoa(labels=TRUE,method="nei",export="pdf",axis1=1,axis2=2,do_mirror=c(FALSE,FALSE),addlegend=TRUE,legendpos="topright",legendcex=1.5,symbolsize=2,dodc=FALSE) ``` <span style="color:blue"> What does it mean that some individuals are close to each other in this plots? Are all hatchlings grouped together or with one of the species? </span> Now let's estimate the ancestry proportions of each individual to each cluster. ``` runLEA(mindemes=2,maxdemes=6) LEAstructureplot(mindemes=2,maxdemes=6,export="pdf",dolabelcol=TRUE,addindname=TRUE) ``` Or since we are mostly interested in analyzing the results for k=2 as our dataset contains individuals from two distinct species and potentially hybrids: `findstructure(onlyLEA=TRUE,Kmin=2,Kmax=2,LEAheightfactor=3,LEAwidthfactor=0.1,LEAaxiscex=1.5,LEAlabelcex=1.75,LEAyaxiscex=1.25,LEAylabel=NULL)` <span style="color:blue"> Is there any evidence of admixed individuals? Why do you think one of the "control" individuals identified morphologically as Hawksbill Turtle (*Eretmochelys imbricata*) shows signs of admixture? [*Hint: inspect inds data object*] . </span> Taking into account all analysis: <span style="color:blue"> What is the species/hybrid status of the 14 sea turtles' hatchlings of unknown ancestry? </span> ### General advice on running SambaR * If you need to re-run any function, make sure that the plots are not opened in any viewer. * The file SambaR_methods.txt can really help describe methods used (useful for your protocol). * To inspect all possible arguments of any function you can type: `args(function_name)`. You will see that for most functions you can change the output plots format to .png, if you prefer. # References * Arantes, L. S., Ferreira, L. C. L., Driller, M., Repinaldo Filho, F. P. M., Mazzoni, C. J., & Santos, F. R. (2020). Genomic evidence of recent hybridization between sea turtles at Abrolhos Archipelago and its association to low reproductive output. Scientific Reports, 10(1), 12847. * Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & 1000 Genomes Project Analysis Group. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158. * Frichot, E., & François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929. * Jombart, T., Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC genetics, 11(1), 1-15. * Pritchard, J. K., Stephens, M. & Donnelly, P.(2000). Inference of population structure using multilocus genotype data. Genetics 155, 945–959. * Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575. * Nei, M. (1973). Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, 70(12), 3321-3323. * Viengkone, M., Derocher, A. E., Richardson, E. S., Malenfant, R. M., Miller, J. M., Obbard, M. E., ... & Davis, C. S. (2016). Assessing polar bear (Ursus maritimus) population structure in the Hudson Bay region using SNPs. Ecology and Evolution, 6(23), 8474-8484. * Weir, B. S., & Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution, 38(6), 1358-1370