###### tags: `Master's Labcourse 2024: Molecular Population Genetics & Conservation Genetics
04.03 - 15.03, 2024
Marisol Domínguez
# Population genetics analysis

* week 1: labwork following the printed protocol.
* week 2: statistical analysis following this online tutorial.
## Goals of the Tutorial
During this part of the course, we will learn how to analyze genetic data and how it can assist making informed management decisions. We will use a large dataset of previously sequenced samples to analyze the species' overall genetic diversity. Then, we will inspect and edit the output produced by the sequencer from the samples you have processed in the laboratory.
Our specific objetives are:
- Investigate the diversity and population structure of the yellow cardinal (*Gubernatrix cristata*) using two types of genetic markers.
- Apply wildlife DNA forensics to assign the most probable origin to confiscated birds.
---
## PART A
Different markers (haploid, dominant, codominant) can be used to investigate population structure. However, a combination of them ususally produces the maximum amount of information (Rowe et al. 2017), when whole genomes are not available.
## Molecular marker: mitochondrial DNA
### ***Programs***
Numerous programs are available for assessing genetic diversity. In this tutorial, we describe the use of some of them which are freely available.

### *Software's manuals/papers*
[BioEdit](https://www.researchgate.net/publication/258565830_BioEdit_An_important_software_for_molecular_biology/)
[DnaSP
](https://www.researchgate.net/publication/320413912_DnaSP_6_DNA_Sequence_Polymorphism_Analysis_of_Large_Datasets/)
[Arlequin](https://journals.sagepub.com/doi/full/10.1177/117693430500100003)
[PopART](http://popart.otago.ac.nz/documentation.shtml)
---
### **Genetic diversity of mitochondrial DNA**
mitochondrial organelle:

Avian mitochondrial genome organization :

* Maternally inherited: individuals receive their mtDNA exclusively from their mother
* Haploid: it exists in a single copy per cell
* No homologous recombination: the genetic material from the mother is passed on unchanged
* Control Region (CR or D-loop): is Non-coding, and has a fast substitution rate
---
# Genetic diversity and population structure
We will analyze a big sample size of sequences from wild individuals of known origin. Overall the samples belong to adults from 5 different regions along the known distribution of yellow cardinals: CO (Corrientes), LP (La Pampa), RN (Rio Negro), MD (Mendoza), U (Uruguay).
## Distribution of yellow cardinals :bird:

###### Map showing the (historical) asumed distribution of yellow cardinals (dark grey area). The points denote sampling localities. Adapted from Domínguez *et al.* 2017.
### A1. Diversity measures in natural populations
We will use the software **DNAsp** to analyze the alignment of sequences (*infileCRYelCar.fas*). Note that the name of the samples includes information about their geographical origin.
```
File --> Open Data File (change extension to All files (*.*)
In Data --> Format indicate that we are dealing with mtDNA, haploid genome
Overview --> Polymorphism Data (Data Set: include all sequences, and all sites)
```
<span style="color:blue">• How many haplotypes are in the sequence’s dataset?
</span>
<span style="color:blue">• How many polymorphic sites (S) are there in total?
</span>
---
### A2. Within population variation
Populations may have been subjected to varying selective forces, leading to distinct evolutionary paths.
We will compare:
**Hd: haplotype diversity** (probability of two randomly selected haplotypes being different), and
**Pi: nucleotide diversity** (average number of nucleotide differences per site between two random haplotypes)
Note:
Large Hd + small Pi → many similar mitochondrial genotypes
Small Hd + large Pi → few but very different
genotypes
```
Define populations in: Data --> Define Sequence Sets
```

```
Repeat the steps performed before Overview --> Polymorphism Data but this time select each population individually at Data Set.
```
<span style="color:blue">• Which population has the greatest number of haplotypes?
</span>
<span style="color:blue">• What is the haplotype diversity per population? Which population(s) show the highest levels of Hd?
</span>
<span style="color:blue">• How does nucleotide diversity (Pi) vary across the studied populations?
</span>
```
Before closing DNAsp:
- export this file with the sequences clustered into populations as *.NEXUS (to be used to create an haplotype network in PopART).
- export this file too into Arlequin format (*.hap and .*arp). Make sure you save both file to the same directory and with the same name (different extensions).
```
---
### A3. Among populations variation
AMOVA stands for Analysis of MOlecular VAriance and is a method design to assess population differentiation using molecular markers (Excoffier, Smouse & Quattro, 1992).
In populations with random mating (panmictic populations), we would expect most of the variance to arise from within samples. If most of the variance occurs among samples within populations or among populations, then there is evidence for some sort of population structure.
The statistic FST measures the proportion of genetic variation that there is within populations relative to the variation between populations. The values of the fixation index Fst range from 0 to 1. A value of 0 indicates no population structuring or subdivision (panmixia), while a value of 1 suggests that the populations examined do not share any genetic diversity.
Some consider genetic differentiation as low when FST<0.05, moderate when 0.05<FST<0.15, high when 0.15<FST<0.25, and very high when FST>0.25 (Wright 1978).
```
Check if all sequences were included in .arp file:
open the .arp file you just exported from DnaSP in a text editor (like Notepad), scroll to the end of the file. Check that the last group does not start with SampleName = "Additional_Seqs", which is an indication that some sampes were not included in the grouping in DnaSP.
Manually add at the end of the file a Structure Block with the information to structure the sequences into populations (see image or copy text bellow). Save as *_ed.arp [be aware of use exactly the same population names as before!]
[[Structure]]
StructureName="A group of 5 populations analyzed for DNA"
NbGroups=1
Group= {
"CO"
"LP"
"MD"
"RN"
"U"
}
```

```
Open Arlequin (WinArl35.exe) and Open a New Project selecting the *_ed.arp file.
In the third tab “Settings”, select AMOVA --> Standard AMOVA calculations.
Select also Population comparisons --> Compute pairwise FST.
Press on Start.
In the same folder where the infiles are located a new folder *.res was generated.
The *_main.htm file contains the results. We can open this file by right click --> Open with --> Internet Explorer.
[Note: if it is automatically opened using Edge you can open the .htm File in IE Mode by clicking on the settings menu (three dots) in the top-right corner of the browser, and then in "Reload in Internet Explorer mode"]
```
<span style="color:blue">• Is there any evidence for genetic structure between populations?
</span>
<span style="color:blue">• Which pair of populations differ the most?
</span>
<span style="color:blue">• What percentage of the molecular variance is due to differentiation between individuals within a population?
</span>
<span style="color:blue">• What percentage is due to differentiation among populations?
</span>
### A4. Haplotypes network
```
Finally, let’s draw a haplotype network in order to visualize the distance between the different haplotypes.
Edit (in Notepad) the NEXUS file generated in DnaSP adding the population information for each sequence at the end of the file (copy/paste the following and complete for all the 94 individuals):
BEGIN TRAITS;
[The traits block is specific to PopART. The numbers in the matrix are number of
samples associated with each trait. The order of the columns must match the
order of TraitLabels. Separator can be comma, space, or tab.]
Dimensions NTRAITS=5;
Format labels=yes missing=? separator=Comma;
TraitLabels CO LP MD RN U;
Matrix
C007 1,0,0,0,0
C013 1,0,0,0,0
C020 1,0,0,0,0
C022 1,0,0,0,0
C023 1,0,0,0,0
...
L001 0,1,0,0,0
...
MD1 0,0,1,0,0
...
R002 0,0,0,1,0
...
U129 0,0,0,0,1
...
;
end;
--------------------------------
Note that the individuals’ names and order must coincide with the beginning of NEXUS file.
[Note: The file should finish with "; end;"]
---------------------------
Open the software PopART.
Press the bottom NEX to open the edited file
Click on Network --> Medium Joining Network (ok)
Edit colors (optional) and haplotypes names
```
<span style="color:blue">• How many haplotypes are shared between populations?
</span>
<span style="color:blue">• How many are private haplotypes (unique to that population) are in Corrientes (CO)? How many in Mendoza (MD)?
</span>
<span style="color:blue">• Is there any population without any private haplotype?
</span>
```
In PopART, select "Statistics -> Identical Sequences".
```
<span style="color:blue">• How many abundant haplotypes are there? what percentage of the birds exhibit these abundant haplotypes?
</span>
---
### A4. References
Domínguez, M., Tiedemann, R., Reboreda, J. C., Segura, L., Tittarelli, F., & Mahler, B. (2017). Genetic structure reveals management units for the yellow cardinal (*Gubernatrix cristata*), endangered by habitat loss and illegal trapping. Conservation Genetics, 18(5), 1131-1140.
Excoffier, L., Smouse, P. E., & Quattro, J. M. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131(2), 479-491.
Rowe, G., Sweet, M., & Beebee, T. J. C. (2017). An introduction to molecular ecology. Oxford University Press.
Wright, S. 1978. Evolution and the Genetics of Populations, Vol. 4. Variability Within and Among Natural Populations. University of Chicago Press, Chicago
---
---
## PART B
## Molecular markers: microsatellites
Microsatellites or SSR (simple sequence repeats) are tandem repeats of a two- to six-base pair motif flanked by conserved non-repetitive regions. Microsatellites exhibit a high degree of polymorphism, mainly due to variations in the number of repeats. Because they are hypervariable, non-codifying and due to their co-dominant character, microsatellites are widely used in population genetics.
### ***Programs***

This tutorial assumes that alleles were already called ([Genemapper](https://www.thermofisher.com/order/catalog/product/de/en/4370784)).
### B1. Calculate allele frequencies and determine private alleles
Activate Genalex macro in Excel (just double click on .xlam file in the folder where Genalex was installed).
Open the data file: “infileMicrosatellites_YelCar_ord.xlsx”
The first column should contain sample names
the second information about the region
and the rest of the columns are the alleles of the 10 microsatellite loci.
```
Add two empty rows before the data
Click on GenAlEx -> Parameters -> Pops from col2
```

Now the first two rows were filled with the names and number of individuals per population.
```
In the Genalex menu:
choose Frequency-Based -> Frequency.
Enter the number of loci
Check that Number of Samples and Number of Populations are correct.
click OK
```
Examine the Allele frequency data parameters window and click OK. (Note that these are “codominant” data).
```
In the next window choose:
Frequency by Pop
Het, Fstat & Poly by Pop
Private Alleles List
(unclick Graph All Loci)
```

Examine the spreadsheets produced by Genalex.
*(It is advisable not to close Excel until the end of Part B of the tutorial since we will need this spreadsheet recurrently).*
`AFP spreadsheet contains the frequencies of all the alleles at each locus at each population: `
<span style="color:blue">• Which is the most polymorphic **locus** (the one with the highest number of alleles)? </span>
`HFP spreadsheet contains information on the number of alleles, observed heterozygosity and expected heterozygosity over loci for each population:`
<span style="color:blue">• Which **population** shows the highest mean number of alleles over loci (Na)? </span>
```
PAS spreadsheet contains the summary of private alleles per population:
```
<span style="color:blue">• Which population shows the highest number of private alleles?
</span>
### B2. Calculate Allelic Richness
Allelic richness (AR) is a measure of the number of alleles independent of sample size,
hence allowing to compare the number of alleles between populations with different sample sizes.
```
Export the spreadsheet containing the alleles information (Sheet1) to a genpop format:
Genalex -> Import-Export -> Export -> GenPop. Name it as: infile_GenpopFormat.gen
(remember to add the extension .gen)
```
** If you experience problems trying to add .gen, it is probably because you need to change the configuration of windows to allow you to change extension files. [Check this!](https://support.winzip.com/hc/en-us/articles/115011457948-How-to-configure-Windows-to-show-file-extensions-and-hidden-files)
** Do not close this file in Excel. We will need this later to convert it to other formats required for other programs
```
Open the program FSTAT.
If requested, write any number to create a seed to be able to reproduce the results (FSTAT uses a randomization method to test the data).
Click on Utilities -> File Conversion -> “Genpop -> Fstat”
Click on .gen file we generated with Genalex
(check that a .dat file was generated in the same folder containing the .gen file)
In the program Fstat, click on File -> Open and choose the .dat file
Click only on Allelic richness and then on Run at the bottom of the window.
A .out file with the results should have been generated in the same folder containing the .gen and .dat files.
Calculate the average AR per population considering that each column is one population (in the order they were in the excel file).
The last column is the allelic richness at the locus under consideration overall populations.
```

<span style="color:blue">* Which population(s) show the highest allelic richness? Please also report based on which population the calculations were performed (which has the smallest sample size). </span>
### B3. Hardy-Weinberg Equilibrium and Linkage Disequilibrium

**Hardy-Weinberg equilibrium** = circumstances in which the allele frequencies within a population remain constant from generation to generation, unless there are evolutionary forces acting upon them.
For a population to achieve Hardy-Weinberg equilibrium, it must satisfy five criteria:
1. **No Natural Selection**: no allele confers a survival or reproductive advantage over others.
1. **No Mutations**: the gene pool is not modified from one generation to the next.
1. **Large Population Size**: this mitigates the effects of genetic drift, which can randomly alter allele frequencies in small populations, potentially leading to the fixation or loss of alleles.
1. **No Migration**: There are no inmigrants that can introduce new alleles from other populations (no gene flow) neither can individuals (taking with them certain alleles) leave the population.
1. **Random mating**: no sexual selection (every individual has an equal opportunity to mate).
We study it because the Hardy-Weinberg model enables to compare the actual genetic structure of a population with what would be expected if the population were in Hardy-Weinberg equilibrium (not evolving).
**Linkage Disequilibrium**: is the non-random association of alleles at different loci (when two alleles from two loci co-occur/are ligated).
It is important to analyze it because it is a signal of the genetic processes that are structuring the population. For example, if there is assortative mating, and individuals with allele A tend to mate with B types rather than C types, AB genotypes will have excess frequency over that for random mating.
Let's check if the cardinal populations are in Hardy-Weinberg equilibrium and examine their loci for linkage disequilibrium! :grin:
```
Export the spreadsheet containing the alleles information to Arlequin format:
Genalex -> Import-Export -> Export -> Arlequin.
Name it as: infile_micros_Arlequin.arp
(remember to add the extension .arp)
Open Arlequin (WinArl35.exe) and Open a New Project selecting the *.arp file.
In the Settings tab:
Click on Hardy-Weinberg and then on “Perfom exact test of Hardy-Weinberg equilibrium”.
Click on Pairwise linkage: “Linkage disequilibrium between all pairs of loci”
Select AMOVA -> Standard AMOVA calculations.
Select also Population comparisons -> Compute pairwise FST.
Click on START (can take up to **30 minutes** depending on the computer resources - LD is generally slow).
In the same folder where the infiles are located a new folder *.res was generated.
The *_main.htm file contains the results.
We can open this file by right click -> Open with -> Internet Explorer.
```
:hourglass_flowing_sand:  :hourglass_flowing_sand:
Be aware to **adjust** the alpha (significance level) for Linkage Disequilibrium tests, for instance, by applying a **Bonferroni** **correction**. This adjustment is important because we are testing many hypothesis on the same data set what increases the probability of committing type I errors (rejecting the null hypothesis when it is true).
<span style="color:blue">• Do the allele frequencies at microsatellite loci show deviations from HW equilibrium?
Identify specific loci within populations where observed heterozygosity is significantly lower than expected heterozygosity. Consider later for your reports calculating the average Obs. Het and Exp. Het. for each population across loci and presenting it in a table. </span>
<span style="color:blue"> • Is there evidence of linkage disequilibrium between any pair of loci?</span>
## B4. Population Structure
### B4.1 AMOVA and Pairwise comparisons
Arlequin has also produced analysis to study the genetic structure of the birds (check the end of the .htm file). Let's take a look at the AMOVA and population pairwise FSTs! :first_quarter_moon:
<span style="color:blue"> • Is there any evidence for genetic structure between populations?</span>
<span style="color:blue"> • What percentage of the molecular variance is due to differentiation between individuals within a population?</span>
<span style="color:blue"> • What percentage is due to differentiation among populations?</span>
<span style="color:blue"> • Which populations differ the most?</span>
### B4.2 Population Structure Analysis using a Bayesian Approach
Now we will estimate the number of genetically distinct groups in our dataset.
```
Export the Excel spreadsheet containing the alleles information to a Structure format:
Genalex -> Import-Export -> Export -> Structure.
Name it as: infile_Structure.txt
```
Please note that missing data has now been coded as -9.
Open **STRUCTURE** and create a **new project**.
*This program uses a Bayesian approach to probabilistically assign individuals to populations in a way to minimize departures from equilibrium.
*
```
Click on File -> New Project
```
In the four panels of the project wizard, enter the following information:
```
Panel 1
Choose a convenient project name,
a directory where you want to store the results, and
select infile_Structure.txt as the data file.
Panel 2
Specify the size of the data matrix, as well as how missing data is coded in the input file.
Individuals: 92
Ploidy: 2
Number of loci: 10
Missing data: -9
Panel 3
In the next two panels, you specify the format of the input file. Here, the rows included in the input file are specified.
Row of marker names: YES
Row of recessive alleles: no
Map distances between loci: no
Phase information: no
Data file stores data for individuals in a single line: YES
Panel 4
Finally, the columns contained in the input file are specified.
Individual ID for each individual: YES
Putative population origin for each individual: YES
USEPOPINFO selection flag: no
Sampling location information: no
Phenotype information: no
Other extra columns: YES (1)
```
```
click ‘Finish’ and then ‘Proceed’
```
In order to run STRUCTURE, you will first have to define a new parameter set
```
Click on the ‘New Parameter Set’ button
```
A window will open where in the first panel you’ll have to specify the run duration for the MCMC chain.
```
Select a burn-in of 2000 iterations followed by a further 15000 MCMC iterations
```
*These values are a lot shorter than would be used to get really accurate answers but will be relatively quick to run.*
```
In the Ancesrty Model tab choose the Admixture model
```
This model allows individuals to have mixed ancestry (they can receive a proportion of ancestry from each of the populations).
```
Click OK and give the new parameter set a name. Suggestion: 2000-15000
Run STRUCTURE by clicking:
Project -> Start a Job.
Select the parameter set you just defined (2000-15000) and test K from 1 to 5 populations.
Number of iterations = 3.
```

Once it finishes, you can click on *Simulation Summary* to see the values of **Ln P(D)** for each run.
The model choice criterion implemented in STRUCTURE to detect the true K is an estimate of the posterior probability of the data for a given K, Pr(X|K) (Pritchard et al. 2000). This value is the **log likelihood** for each K and is called 'Ln P(D)' in the output.
The log-likelihood value is a way to measure the goodness of fit for a model.
The higher the value of the log-likelihood, the better a model fits a dataset. For example, a log-likelihood value of -3000 is better than -3100.
*If you are interested, there are other approaches you can explore to choose the best k*, *like* [Evanno's method](https://pubmed.ncbi.nlm.nih.gov/15969739/) (*the most likely K is where the largest change in magnitude ln Pr(X|K) against successive K values occurs*), *and alternative programs to choose best k:*
http://taylor0.biology.ucla.edu/structureHarvester
http://clumpak.tau.ac.il/ (under tab: Best K).
Populations delimited by researchers on the basis of sampling often do not coincide with the actual behaviour of the gene pools involved. It may happen that two entities taken as distinct form a single panmictic unit, and conversely it may also happen that what is considered to be a homogeneous population is not.
<span style="color:blue"> • What is the most likely number of clusters in the data we are analyzing?</span>
#### Bayesian Clustering Plot
```
1. Go to the Results folder from your Structure results (for example: MyDoc/Labkurse/2000-15000/Results)
2. Zip the folder: Results.zip (right click on folder name -> Send To -> compressed zip folder)
3. Upload that to Clumpak website: http://clumpak.tau.ac.il/ in the tab “Main Pipeline and create a Structure plot”.
```
In the plot, the y axis is the posterior probability (q) which describes the proportion of an individual genotype originating from each of the K genetic clusters.
<span style="color:blue"> • What represents each bar of the plot, and each color?</span>
<span style="color:blue"> • Considering a threshold of 50%: Which percentage of the individuals from CO population are assigned to the same cluster? Discuss what happened with the rest of the populations.</span>

## B4 References
Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular ecology, 14(8), 2611-2620.
Pritchard JK, Stephens P, Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics, 155, 945–959
Rowe, G., Sweet, M., & Beebee, T. J. C. (2017). An introduction to molecular ecology. Oxford University Press.
---
---
## PART C
## FORENSIC ANALYSIS

### C1. Mitochondrial haplotypes in confiscated cardinals
We have already studied the populations variation and structure of wild yellow cardinals of known origen. Let's see what information can we gain from analysing mtDNA control region sequences from the confiscated birds.
### C1.1 Edition of sequences (pre-processing)
**Edit raw sequences**
The results of the DNA sequencing are provided in two data files:
• .ab1 file contains the DNA sequence electropherogram as well as raw data
• .seq file is a simple sequence text file.
An electropherogram shows a sequence of peaks in four colors, where each color represents the base called for that peak. There is also a textual version of recorded sequence visible that we can edit:

```
Open in BioEdit each .ab1 file of the samples we have sequenced.
Resize windows to be able to inspect at the same time the spectrogram and the nucleotide sequence.
Examine the base calling and edit sequence: manually check the peaks in the electropherogram. Correct any base incorrectly called or undefined bases.
Erase the region at the beginning of the sequence where nucleotide determination is not possible.
Be aware of the different manual realignment modes:
```

When evaluating the .ab1 files, after checking the electropherogram you can conclude if your data can be considered of good quality or not:
<span style="color:green">Good quality </span> sequencing data are characterized by: well-defined peak resolution (bad resolution of the first 10-25 bases is acceptable), uniform peak spacing, high signal-to-noise ratios.
<span style="color:red">Bad quality </span> .ab1 files usually cannot be opened with BioEdit, show a high signal-to-noise ratio, or finish prematurely.
```
After manually checking the base calling and editing the file:
save the edited sequence with a different name (e.g. Individual1_ed) and in .fasta format.
```
**Reverse complement sequences**
Because we also sequenced each sample using the reverse primer, we also received a sequence which is from 5'- to 3' of the complementary strand.
We need to reverse complement these nucleotide sequences to be able to align them to the corresponding sequence in the forward strand.
The reverse complement of a DNA sequence is formed by reversing the letters, interchanging C for G and A for T. After doing the reverse complement the reverse sequence will be 5'-3' of the sense strand and will be ready to compare with other sequences.
```
To reverse complement a sequence you can use the following shortcut:
Shift + Ctrl + R , or click on the "Sequence" tab, select "Nucleic Acid" option, and then "Reverse Complement".
```
**Merge forward and reverse**
Given that we have sequenced with the forward and with the reverse primers, we have received two sequences per sample. Both sequences will be mostly overlaping but together will hopefully cover the entire fragment of interest (735 bp).
```
Paste both sequences (forward and reverse) asociated with the same sample ID in the same window. Select both sequences (control + click) and then select the "Sequence" tab, "Pairwise alignment", "Align two sequences (allow ends to slide)".
Edit and save with the name: "SampleID_ed.fasta"
```
### BLAST
Using [ncbi webiste](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), copy the sequences obtained to blast them against the nucleotide collection of known sequences to corroborate if they belong to our species of study.
### C1.2 Multiple sequence alignment (MSA)
The aim is to align all the sequences (the edited ones from individuals sequenced during this labcourse and the ones from wild individuals previously sequenced) in a way that maximizes the number of matches (similarity).
Merge the sequences you edited from confiscated birds (N=4) with the previously sequenced samples (N=94):
```
In Bioedit, open the alignment of the 94 provided sequences (infileCR_YelCar.fas).
Paste in the same window all your edited sequences from confiscated birds (to do this, just open your confiscated_ed.fas files and copy/paste them at the end of the wild cardinal's alignment).
[If necessary: manually alignm them, and trim any overhanging bases in your edited sequences exceeding the length of the alignment.]
```
```
Alternative to manual alignment:
Select all sequences (edited + provided).
Click on Accessory Application --> CLUSTALW multiple alignment --> Run ClustalW
Using the bottoms Shade identities and similarities and/or View Conservation by plotting identities identify the mismatches and conserved domains.
Note: depending on the number of sequences and computer performance it can take several minutes to run.
```



```
Check that alltogether they all align well.
Save this new alignment (all sequences must have same length) including wild + confiscated cardinals.
File --> Save As (choose .fasta format).[file name suggestion: infileCR_YelCarWild_and_Confis.fas]
```
### C1.3 Create a haplotype catalog
In order to discover if the samples we sequenced show an haplotype already found before in a particulat region or new haplotypes, we will open the new alignment in the program DnaSP and create a haplotype list.
```
In DnaSP, open the new alignment containing your sequenced samples (.fas)
Select "Generate" -> "Haplotype Data File"
Include Invariable sites and save in nexus format as: HaplotypesList.nex
```
<span style="color:blue">• Do the samples from confiscated birds exhibit new haplotypes that have not been previously discovered? </span>
<span style="color:blue">• How many of the confiscated cardinals have a common haplotype? </span>
<span style="color:blue">• How many show haplotypes private to a region? </span>
## C2. Microsatellite allele calling
Open **GeneMapper** to be able to determine the lengths of the microsatellites.
[Here](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_042039.pdf) the Getting Started Guide.
<span style="color:blue"> How many peaks of different colors do you expect to see on each sample?</span>
The orange-labelled size standard Genescan 500 LIZ (Thermo Fisher Scientific) was added to each sequencing reaction. This contains 16 standardized fragments of different lengths for orientation purposes (35, 50, 75, 100, 139, 150, 160, 200, 250, 300, 340, 350, 400, 450, 490 und 500 bp)

This size standard consists of a mixture of fluorescently labeled DNA fragments of known lengths, that serve as references or markers that allow the determination of the sizes of the unknown DNA fragments in the samples.
```
In GeneMapper, select "Add Samples To Project" -> "Open Data".
Add the .fsa files from all the confiscated individuals, and click aAdd.
Select GS500 for Size Standard.
Select Microsatellite Default for Analysis Method.
[trick: select column and then CONTROL + D to fill down]
Select All [trick: CONTROL + A].
Click on green bottom to Analyze samples.
Choose a ProjectName
Click on Display Plots bottom
```
You can inspect one by one, or all samples together by selecting all.
You can zoom with the mouse on selected regions.
When clicking on the peaks of interest, their size will be shown on the bottom.
[Trick: CONTROL + G to only show selected peaks]
Take notes of the allele lengths.
We have worked with three loci in the laboratory, the sizes of the other seven markers will be provided by the instructors. In this way (by having the information of the length of the 10 microsatellite markers), we will be able to compare the results from the confiscated birds to the variability found in the wild for the species.
- Complete in the excel file "**infileMicrosatellites_YelCar_ord_confiscated_ToComplete1.xlsx**" the information missing regarding the allele sizes for the markers C2, F2 and H9, for the confiscated juvenile individuals.
This file will then contain the complete genotypes of the confiscated individuals, and also from the natural populations you analyzed before (individuals of known origin from natural populations + confiscated cardinals of unknown origin).
## C3. Probability of assignment to one of the reference management units
**Assignment analysis**
These tests are based on the alleles frequencies in the populations, and they report the probability that the genotype of a (confiscated) individual belongs to a population within a set of populations.
```
In Excel with the GenAlex macro activated:
Add 2 rows at the begining, Parameters -> Pops from col2
export -> Structure (revise n of loci and of samples)
Name it as: InFileMicrosStructure_withConfiscated.txt
Edit this file (for example in Notepad), changing in the 3rd column the values to 1
for wild birds of knwon origin, and for 0 for confiscated birds of unknown origin.
Open in Structure a New Project, and just as before complete in 4 steps the project information required
(desired output direcroy, number of individuals, number of loci, etc).
[Panel 3 same options as before:Row of marker names: YES; "Data File Strores Data for Individuals in a Single Line"]
```

Create a parameter set with 2000 steps of burn-in and 15000 MCMC, admixture ancestry model, and allele frequencies correlated. In the advanced tab, apart from computing k, also select "Update allele frequencies using only individuals with POPFLAG=1".


```
Project -> Start a job.
Select the parameters just created
K from "x" to "x" Number of iterations = 3
```
Replace "x" by a number.
<span style="color:blue"> Based on our previous results, which K makes sense to use here?</span>

You can use [CLUMPAK](http://clumpak.tau.ac.il/) to average and plot the probabilities of assignment of each individual through the replicates (as we did in Part B).
<span style="color:blue"> • Considering a threshold of 50%: Are all the confiscated cardinals assigned to the same cluster?</span>
<span style="color:blue"> • Giving these results, what conservation measures would you suggest for these confiscated yellow cardinals?</span>