###### tags: `Labcourse 2023: Evolutionsbiologie/Spezielle Zoologie - PART A'
##### 13-24 February 2023
##### Dr. Marisol Domínguez
# Population genetics of Harbour Porpoises

## Goal of the Tutorial
- Analyze two genetic markers to investigate the diversity and population structure of Harbour porpoises (*Phocoena phocoena*)
---
## PART A
## Molecular marker: mitochondrial DNA
### ***Programs***
Numerous software programs are available for assessing genetic diversity. In this tutorial, we describe the use of some of them which are freely available.

---
### **Genetic diversity of mitochondrial DNA**

Matternaly inherited
Haploid

**Control Region (CR):**
* Non-coding
* Fast substitution rate
---
### A1. Edition and alignment of sequences (pre-processing)
**Edit raw sequences**
The results of the DNA sequencing are provided in two data files:
• .ab1 file contains the DNA sequence electropherogram as well as raw data
• .seq file is a simple sequence text file in FASTA format.
An electropherogram shows a sequence of peaks in four colors, where each color represents the base called for that peak. There is also a textual version of recorded sequence visible that we can edit:

```
Open each .ab1 file to inspect in BioEdit.
Resize windows to be able to inspect at the same time the spectrogram and the nucleotide sequence.
Examine the base calling and edit sequence: manually check the peaks in the electropherogram. Correct any base incorrectly called or undefined bases.
Erase the primer region at the beginning of the sequence where nucleotide determination is not possible.
Be aware of the different manual realignment modes:
```

When evaluating the .ab1 files, after checking the electropherogram you can conclude if your data can be considered of good quality or not:
Good quality sequencing data are characterized by:
well-defined peak resolution (bad resolution of the first 10-25 bases is acceptable)
uniform peak spacing
high signal-to-noise ratios
Bad quality .ab1 files usually cannot be open with BioEdit, show a high signal-to-noise ratio, or finish prematurely.
```
After manually checking the base calling and editing the file, save the edited sequence with different name (e.g. Individual1_ed) and in .fasta format.
```
**Reverse complement sequence** <span style="color:red">(*Only if needed*)
</span>
If you used the reverse primer to sequence, you should have received a sequence which is from 5'- to 3' of the complimentary DNA strand and need to reverse complement these nucleotide sequences to be able to align them to the big dataset of sequences.
The reverse complement of a DNA sequence is formed by reversing the letters, interchanging C for G and A for T (and vice versa). After doing the reverse complement the reverse sequence will be 5'-3' of the sense strand and will be ready to compare with other sequences.
```
To reverse complement a sequence you can use the following shortcut: Shift + Ctrl + R
```
**Multiple sequence alignment (MSA)**
Open all the edited sequences in the same window in Bioedit (use copy sequence / paste sequence).
Please also edit the name in a way that it contains also the locality information.
In order to increase the sample size for downstream analysis, we will also include individuals that were previously sequenced. Overall the samples belong to 7 different regions: NOS (Nord Sea), SKA (Skagerrad), KAT (Kattegat), BES1 (Belt Sea 1), BES2 (Belt Sea 2), BES3 (Belt Sea 3), and IBS (Inner Baltic Sea).
#### Distribution of Harbour porpoises from the North sea to the Baltic sea
:whale:

###### Map of the North Sea/Baltic Sea area within which the samples were taken. NOS = North Sea, SKA = Skagerrak, KAT = Kattegat, BES = Belt Sea, IBS = Inner Baltic Sea.
To merge with previously analyzed samples:
```
Import all the sequences together (File --> Import--> Sequence alignment file ) and select the file named CR_Bachelor_course.fa containing 179 sequences. Note that the name of the sample already contains information about the geographical origin.
```
The aim now is to align all the sequences (the edited ones from individuals sequenced during this labcourse and the ones from individuals previously sequenced) to maximize the number of matches (similarity).
```
Click on Accessory Application --> CLUSTALW multiple alignment --> Run ClustalW
Using the bottoms Shade identities and similarities and/or View Conservation by plotting identities identify the mismatches and conserved domains.
```



```
Delete overhangs and save alignment (all sequences must have same length)
File --> Save As (choose .fasta format).
```
---
### A2. Diversity measures over the entire set

We will use the software DNAsp to analyze the alignment created in the steps before.
```
File --> Open Data File (change extension to All files (*.*)
In Data --> Format indicate that we are dealing with mtDNA, haploid genome
Overview --> Polymorphism Data (Data Set: include all sequences)
```
<span style="color:blue">• How many haplotypes are in the sequences dataset?
</span>
<span style="color:blue">• How many polymorphic sites (S) are there in total?
</span>
---
### A3. Within populations variation
Populations could have been subject to different forces that made them evolve differently.
We will compare:
**Hd: haplotype diversity** (probability of two random haplotypes of being different), and
**Pi: nucleotide diversity** (average number of nucleotide differences per site between two random haplotypes)
Note:
Large Hd + small Pi → many similar genotypes
Small Hd + large Pi → few but very different genotypes
```
Define populations in: Data --> Define Sequence Sets
```

```
Repeat the steps performed before (Overview --> Polymorphism Data) but this time choose each of the population at Data Set.
```
<span style="color:blue">• Which is the population with the highest number of haplotypes?
</span>
<span style="color:blue">• What is the haplotype diversity per population? Which populations show the highest levels of Hd?
</span>
<span style="color:blue">• Does Pi changes in the different populations studied?
</span>
```
Before closing DNAsp:
- export this file with the sequences clustered into populations as *.NEXUS (to be used to create an haplotype network in PopART).
- export this file too into Arlequin format (*.hap and .*arp). Make sure you save both file to the same directory and with the same name (different extensions).
```
---
### A4. Among populations variation
**AMOVA** stands for Analysis of MOlecular VAriance and is a method to detect population differentiation using molecular markers (Excoffier, Smouse & Quattro, 1992).
In panmictic populations, we would expect most of the variance to arise from within samples. If we see that the most of the variance occurs among samples within populations or among populations, then there is evidence that we have some sort of population structure.
**FST** is a statistic usually used to compare the proportion of genetic variation that is within populations relative to that which is between populations. The values of the fixation index Fst range from 0 to 1. A zero value indicates no population structuring or subdivision (panmixia). A value of one implies that the populations examined do not share any genetic diversity.
According to Wright (1978), genetic differentiation can be defined as low for FST<0.05, moderate for 0.05<FST<0.15, high for 0.15<FST<0.25, and very high for FST>0.25
```
Check if all sequences were included in .arp file:
open the .arp file you just exported from DnaSP in a text editor (like Notepad), scroll to the end of the file. Check that the last group does not start with SampleName = "Additional_Seqs". If it does it means that some sampes were not included in the grouping in DnaSP.
Manually add at the end of the file a Structure Block with the information to structure the sequences into populations (see image or copy text bellow). Save as *_ed.arp
[[Structure]]
StructureName="A group of 7 populations analyzed for DNA"
NbGroups=1
Group= {
"NOS"
"SKA"
"KAT"
"BES1"
"BES2"
"BES3"
"IBS"
}
```

```
Open Arlequin (WinArl35.exe) and Open a New Project selecting the *_ed.arp file.
In the third tab “Settings”, select AMOVA --> Standard AMOVA calculations.
Select also Population comparisons --> Compute pairwise FST.
Press on Start.
In the same folder where the infiles are located a new folder *.res was generated.
The *_main.htm file contains the results. We can open this file by right click --> Open with --> Internet Explorer.
```
<span style="color:blue">• Is there any evidence for genetic structure between populations?
</span>
<span style="color:blue">• Was overall genetic differentiation low, moderate or high?
</span>
<span style="color:blue">• What percentage of the molecular variance is due to differentiation between individuals within a population?
</span>
<span style="color:blue">• What percentage is due to differentiation among populations?
</span>
<span style="color:blue">• Which populations differ the most?
</span>
```
Finally, let’s draw a haplotype network in order to visualize the distance between the different haplotypes.
Edit the NEXUS file generated in DnaSP adding the population information for each sequence at the end of the file:
BEGIN TRAITS;
[The traits block is specific to PopART. The numbers in the matrix are number of samples associated with each trait. The order of the columns must match the order of TraitLabels. Separator can be comma, space, or tab.]
Dimensions NTRAITS=7;
Format labels=yes missing=? separator=Comma;
TraitLabels NOS SKA KAT BES1 BES2 BES3 IBS;
Matrix
Individual_1 1,0,0,0,0,0,0
Individual_2 1,0,0,0,0,0,0
Individual_3 0,1,0,0,0,0,0
Individual_4 0,1,0,0,0,0,0
Individual_5 0,0,1,0,0,0,0
Individual_6 0,0,1,0,0,0,0
Individual_7 0,0,0,1,0,0,0
Individual_8 0,0,0,0,1,0,0
;
end;
--------------------------------
**Note** that individuals’ names and order must coincide with beginning of NEXUS file
---------------------------
Open the software PopART.
Press the bottom NEX to open the edited file
Click on Network --> Median Joining Network (ok)
Edit colors and haplotypes names (optional)
```
<span style="color:blue">• How many haplotypes are shared between populations?
</span>
<span style="color:blue">• How many are private haplotypes (unique to that population) are in Kattegat? How many in Skagerrad? And in the Inner Baltic Sea?
</span>
<span style="color:blue">• Is there any population without any private allele?
</span>
```
Click on Statistics - Identical Sequences
```
<span style="color:blue">• How many abundant alleles are there? what percentage of the whales exhibit these abundant haplotypes?
</span>
### A5. Software's manuals/papers
[BioEdit](https://www.researchgate.net/publication/258565830_BioEdit_An_important_software_for_molecular_biology/)
[DnaSP
](https://www.researchgate.net/publication/320413912_DnaSP_6_DNA_Sequence_Polymorphism_Analysis_of_Large_Datasets/)
[Arlequin](https://journals.sagepub.com/doi/full/10.1177/117693430500100003)
[PopART](https:/popart.otago.ac.nz/doc/popart.pdf/)
### A6. References
Excoffier, L., Smouse, P. E., & Quattro, J. M. (1992). Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131(2), 479-491.
Wiemann, A.; Andersen, L.W.; Berggren, P.; Siebert, U.; Benke, H.; Teilmann, J.; Lockyer, C.; Pawliczka, I.; Sko ́ra, K.; Roos, A.; Lyrholm, T.; Paulus, K. B.; Ketmaier, V.; Tiedemann, R. (2010) Mitochondrial Control Region and microsatellite analyses on harbour porpoise (Phocoena phocoena) unravel population differentiation in the Baltic Sea and adjacent waters. Conservation Genetics 11, 195 – 211.
Wright, S. 1978. Evolution and the Genetics of Populations, Vol. 4. Variability Within and Among Natural Populations. University of Chicago Press, Chicago