Population genetics of Harbour Porpoises

###### tags: `Labcourse 2023: Evolutionsbiologie/Spezielle Zoologie - PART B ##### 13-24 February 2023 ##### Dr. Marisol Domínguez # Population genetics of Harbour Porpoises ![](https://i.imgur.com/HPzIcvB.png) ## Goal of the Tutorial - Analyze two genetic markers to investigate the diversity and population structure of Harbour porpoises (*Phocoena phocoena*) --- ## PART B ## Molecular markers: microsatellites Microsatellites or SSR (simple sequence repeats) are composed of tandem repeats of a two- to six-base pair motif flanked by conserved non-repetitive regions. The mutation rate of microsatellites is very high, so these regions exhibit a high degree of polymorphism, mainly due to variation in the number of repeats. Because they are hypervariable, non-codifying and due to their co-dominant character, microsatellites are widely used in population, evolutionary and genetic studies. ### ***Programs*** ![](https://i.imgur.com/BwB9cE0.png) This tutorial assumes that alleles were already called and that genotyping errors were checked. ![](https://i.imgur.com/PkAxqhf.png) ### B1. Calculate allele frequencies and determine private alleles Activate Genalex macro in Excel (just double click on .xlam file in the folder where Genalex was installed). Open the data file: “infileMicrosatellites.xlsx” The first column should contain sample names the second information about the region and the rest of the columns are the alleles of the 10 loci. *Note that the data set must be sorted by regions (all of the populations from the first region must come first, then populations from the second region, etc).* ``` Add two empty rows before the allelic data (if needed). Click on Parameters -> Pops from col2 ``` ![](https://i.imgur.com/nL2ekMZ.png) Now the first two rows were filled with the names and number of individuals per population. • What is the sample size per geographic location? ``` In the Genalex menu: choose Frequency-Based -> Frequency. Enter the number of loci Check that Number of Samples and Number of Populations are correct. click OK ``` Examine the Allele frequency data parameters window and click OK. (Note that these are “codominant” data). ``` In the next window choose: Frequency by Pop Het, Fstat & Poly by Pop Private Alleles List (unclick Graph All Loci) ``` ![](https://i.imgur.com/tCOyR0g.png) Examine the spreadsheets produced by Genalex. `AFP spreadsheet contains the frequencies of all the alleles at each locus at each population. ` • Which is the most polymorphic locus (the one with the highest number of alleles)? `HFP spreadsheet contains information on the number of alleles, observed heterozygosity and expected heterozygosity over loci for each population.` • Which population(s) show(s) the highest mean number of alleles over loci (Na)? ``` PAS spreadsheet contains the summary of private alleles per population. ``` • Which population(s) show(s) the highest number of private alleles? ### B2. Calculate Allelic Richness Allelic richness (AR) is a measure of the number of alleles independent of sample size, hence allowing to compare the number of alleles between populations with different sample sizes. ``` Export the spreadsheet containing the alleles information to a genpop format: Genalex -> Import-Export -> Export -> GenPop. Name it as: infile_GenpopFormat.gen (remember to add the extension .gen) ``` ** *If you experience problems trying to add .gen, it is probably because you need to change the configuration of windows to allow you to change extension files.* [Check this!](https://support.winzip.com/hc/en-us/articles/115011457948-How-to-configure-Windows-to-show-file-extensions-and-hidden-files) ** *Do not close this file in Excel. We will need this later to convert it to other formats required for other programs.* ``` Open the program FSTAT. If asked, write any number to create a seed to be able to reproduce the results (FSTAT uses a randomization method to test the data). Click on Utilities -> File Conversion -> “Genpop -> Fstat” Click on .gen file we generated with Genalex (A .dat file was generated) Click on File -> Open and choose the .dat file Click only on Allelic richness and then on Run at the bottom of the window. A .out file with the results was generated. Calculate the average AR per population considering that each column is one population (in the order they were in the excel file). The last column is the allelic richness at the locus under consideration overall populations. ``` ![](https://i.imgur.com/A2igvQ4.png) • Which populations show the highest allelic richness? Please report based on which population the calculations were performed (which has the smallest sample size). ### B3. Hardy-Weinberg Equilibrium and Linkage Disequilibrium ![](https://i.imgur.com/aKMkQWo.png) **Hardy-Weinberg equilibrium** = circumstances in which the frequency of alleles in a population of sexually reproducing diploid organisms remains constant from generation to generation, unless there are evolutionary forces acting upon them. In order to be in Hardy-Weinberg equilibrium a population must meet five assumptions: 1. **No Natural Selection**: no allele is more benefitial than any other. 1. **No Mutations**: the gene pool is not modified from one generation to the other. 1. **Large Population Size**: because small populations are more vulnerable to *genetic drift* (random alleles frequencies changes that can lead to fixation or loss of alleles). 1. **No Migration**: There are no inmigrants that can bring new alleles from other populations (no gene flow) neither can individuals (taking with them certain alleles) leave the population. 1. **Random mating**: no *sexual selection* (no individual has better chances of mating than any other). We study it because the Hardy-Weinberg model enables to compare a population's actual genetic structure with the genetic structure we would expect if the population is in Hardy-Weinberg equilibrium (**not evolving**). **Linkage Disequilibrium**: is the non-random association of alleles at different loci (when two alleles from two loci co-occur/are ligated). It is important to analyze it because it is a signal of the genetic processes that are structuring the population. For example, if there is assortative mating, and individuals with allele A tend to mate with B types rather than C types, AB genotypes will have excess frequency over that for random mating. Let's check if the harbour porpoises' populations are in Hardy-Weinberg equilibrium or if they have loci at linkage disequilibrium! :grin: ``` Export the spreadsheet containing the alleles information to Arlequin format: Genalex -> Import-Export -> Export -> Arlequin. Name it as: infile_micros_Arlequin.arp (remember to add the extension .arp) Open Arlequin (WinArl35.exe) and Open a New Project selecting the *.arp file. In the Settings tab: Click on Hardy-Weinberg and then on “Perfom exact test of Hardy-Weinberg equilibrium”. Click on Pairwise linkage: “Linkage disequilibrium between all pairs of loci” Select AMOVA -> Standard AMOVA calculations. Select also Population comparisons -> Compute pairwise FST. Click on START (it can take up to 30 minutes depending on resources available - LD is slow) In the same folder where the infiles are located a new folder *.res was generated. The *_main.htm file contains the results. We can open this file by right click -> Open with -> Internet Explorer. ``` :hourglass_flowing_sand: ![](https://i.imgur.com/WNrGh4G.png) Be aware to **adjust** the alpha (significance level) for Linkage Disequilibrium tests by, for example, applying a **Bonferroni** **correction**. It is important because since we are testing many hypothesis on the same data the probability of committing type I errors (rejecting the null hypothesis when it is true) increases. • Do allele frequencies of the microsatellite loci deviate from HW equilibrium? In which loci in which populations Obs. Het. is significantly lower than Expected Het.? Consider calculating the average Obs. Het and Exp. Het. for each population across loci to report in a table. • Was there evidence of linkage disequilibrium between any pair of loci? ## B4. Population Structure ### B4.1 AMOVA and Pairwise comparisons Arlequin has also produced analysis to study the genetic structure of the whales (check the end of the .htm file). Let's take a look at the AMOVA and population pairwise FSTs! :first_quarter_moon: • Is there any evidence for genetic structure between populations? • What percentage of the molecular variance is due to differentiation between individuals within a population? • What percentage is due to differentiation among populations? • Which populations differ the most? ### B4.2 Population Structure Analysis using a Bayesian Approach Now we will estimate the number of genetically distinct groups in our dataset. ``` Export the Excel spreadsheet containing the alleles information to a Structure format: Genalex -> Import-Export -> Export -> Structure. Name it as: infile_Structure.txt ``` *Note that missing data (if present) is now coded as -9.* Open STRUCTURE and create a new project ``` Click on File -> New Project ``` In the four panels of the project wizard, enter the following information: ``` Panel 1 Choose a convenient project name, a directory where you want to store the results, and select infile_Structure.txt as the data file. Panel 2 Specify the size of the data matrix, as well as how missing data is coded in the input file. Individuals: 170 Ploidy: 2 Number of loci: 10 Missing data: -9 Panel 3 In the next two panels, you specify the format of the input file. Here, the rows included in the input file are specified. Row of marker names: yes Row of recessive alleles: no Map distances between loci: no Phase information: no Data file stores data for individuals in a single line: yes Panel 4 Finally, the columns contained in the input file are specified. Individual ID for each individual: yes Putative population origin for each individual: yes USEPOPINFO selection flag: no Sampling location information: no Phenotype information: no Other extra columns: yes (1) ``` ``` click ‘Finish’ and then ‘Proceed’ ``` In order to run STRUCTURE, you’ll first have to define a new parameter set ``` Click on the ‘New Parameter Set’ button ``` A window will open where in the first panel you’ll have to specify the run duration for the MCMC chain. ``` Select a burn-in of 2000 iterations followed by a further 15000 MCMC iterations ``` These values are a lot shorter than we would use to get really accurate answers but will be relatively quick to run. ``` In the Ancesrty Model tab choose the Admixture model ``` This model allows individuals to have **mixed ancestry** (they can receive a proportion of ancestry from each of the populations) ``` Click OK and give the new parameter set a name. Suggestion: 2000-15000 Run STRUCTURE by clicking: Project -> Start a Job. Select the parameter set you just defined (2000-15000) and test K from 2 to 6 populations. Number of iterations = 3. Start. ``` ![](https://i.imgur.com/apmKI7W.png) Once it finishes, you can click on *Simulation Summary* to see the values of **Ln P(D)** for each run. The model choice criterion implemented in STRUCTURE to detect the true K is an estimate of the posterior probability of the data for a given K, Pr(X|K) (Pritchard et al. 2000). This value is the **log likelihood** for each K and is called 'Ln P(D)' in the output. *If you are interested, there are other approaches you can explore to choose the best k* (*like* [Evanno's method](https://pubmed.ncbi.nlm.nih.gov/15969739/)). *and alternative programs to choose best k:* http://taylor0.biology.ucla.edu/structureHarvester http://clumpak.tau.ac.il/ (under tab: Best K). • What is the most likely number of clusters in the data we are analyzing? #### Bayesian Clustering Plot ``` 1. Go to the Results folder from your Structure results (for example: MyDoc/Labkurse/2000-15000/Results) 2. Zip the folder: Results.zip (right click on folder name -> Send To -> compressed zip folder) 3. Upload that to Clumpak website: http://clumpak.tau.ac.il/ in the tab “Main Pipeline and create a Structure plot”. ``` • What represents each bar of the plot, and each color? Discuss if you found evidence of different genetic clusters in the data set studied with the microsatellite markers. ![](https://i.imgur.com/e5xXlBg.png) ## B5. References Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular ecology, 14(8), 2611-2620. Pritchard JK, Stephens P, Donnelly P. (2000) Inference of population structure using multilocus genotype data. Genetics, 155, 945–959