Practical 4: Language phylogeny and taking phylogeny into account

# Introduction This practical aims to get you more familiar with **linguistic phylogenetics** - linking to Lecture 12 - via hands-on coding and analysis, and also with **Bayesian phylogenetics** and how it works. In the last part, we also look at how to **quantitatively control for phylogeny** and examine the amount of **phylogenetic signal** in data, using both biological and cultural data. You should then be able apply these methods youselves to your own data for the protocol... (See note below) ![](https://hackmd.io/_uploads/SyzANbl5n.png) ## Data The data required is available here: https://drive.google.com/drive/folders/1BjTv3M6wN_pG4sIxsNq1NDIPIpjF67dp?usp=sharing ## Required software For the practical you will need: * [R](https://www.r-project.org/), and the packages nlme, ape, geiger and phangorn - these can be installed with `install.packages("PACKAGENAME")`. It makes sense to turn them all on at the beginning of the prac, so with `library(PACKAGENAME)` * [TNT](https://www.lillo.org.ar/phylogeny/tnt/) * [Mesquite](https://www.mesquiteproject.org/) * Excel or [Libre Office](https://www.libreoffice.org/) * [MrBayes](https://nbisweden.github.io/MrBayes/download.html) * [FigTree](http://tree.bio.ed.ac.uk/software/figtree/) * [SplitsTree](https://software-ab.cs.uni-tuebingen.de/download/splitstree4/welcome.html) * A text editor ## Note on the homework A word of warning with the protocol - it will be a bit different from the others! This time, we would like you to be a little more independent. First, you will code and analyze your own language matrix, using words and languages with which you might be familiar (or not, if you like!), using both parsimony and Bayesian approaches. Next, you will find your own phylogenetic and trait data from the literature - from *any* publication about anything - and do a non-phylogenetic and phylogenetic regression, and test for phylogenetic signal in the data. It doesn't matter at all if your results make sense - it's just about the method. # Part 1: Coding languages In this part of the prac, you will try out coding language data as data matrices based on cognate (word) data and phonological (sound) data. You will try this out using an example dataset, but can expand this on your own and/or for homework. You will then undertake phylogenetic analysis of this data. ## Cognate data First let's look back at that table we saw in the lecture of words in different languages: ![](https://hackmd.io/_uploads/ByAjGZ1qh.png) Code these words in Excel using binary coding, like this: ![](https://hackmd.io/_uploads/B18EXWk9h.png) Once you have finished, make a new project in Mesquite, and give it the appropriate number of taxa and characters for your data (choose *Make character matrix*, and use standard categorical data). Then copy and paste your data into Mesquite, to end up with something in this style, but with more characters: ![](https://hackmd.io/_uploads/S1DWNbk53.png) Now export this as a TNT file (File->Export->TNT), and put it in the same folder as TNT (Windows) or in you username folder (Mac; = home directory). Now open TNT and you can try running the file, with Greek (automatically actually ayway, as it is at the top) as the outgroup: ``` proc cognates.tnt Outgroup Greek hold10;mu50;bb; ``` If there is more than one tree then make a consensus and save it (Nelson consensus, * saves): ``` ne* ``` Then see the tree (just skip `ne` if only one tree anyway): ``` tplot ``` On Mac you might need to type the number and use a semi-colon, so e.g.: `tplot 0;` We are going to use code this time for TNT so that all computer types can do it, but you can also use the buttons in the GUI. Just ask me or look back to Prac 1 if you can't remember how to do this! What is the result? How many trees are there? Does it make sense? Now we can also check where the synapomorphies plot on the tree. You can either do this via the GUI (Optimize->Synapomorphies->Map common synapomorphies or List common synapomorphies), or you can get it by typing: For mapping `apo N` For a list `apo- N` Where N is the tree number. If a search returned two trees, the third (number 2!! TNT starts at zero) will be consensus, which is the one you want. If you want to see the node numbers to match up with the list, make sure they are turned on via the GUI first through Format->Show node numbers (I am not sure the code for this) or type: `naked -` Before plotting the tree. `naked =` turns them off again. If you ever want help with the commands you can type e.g. `help apo` and it shows some info! Did you get this to work? ...So what does this tell us? What words are synapomorphies of what groups? ## Phonological data We can also try coding sounds. Try coding the following table, this time as multistate characters: ![](https://hackmd.io/_uploads/SJR5BZJ9h.png) So in Excel, you should have something like this: ![](https://hackmd.io/_uploads/rJ1SIbk93.png) Now do the same stuff, copying it into Mesquite and exporting for TNT, and run it in TNT with Gothic as the outgroup: ``` proc sounds.tnt Outgroup Gothic hold10;mu50;bb; ``` If there is more than one tree then make a consensus and save it: ``` ne* ``` Then: ``` tplot ``` What does this tree look like? Is it different from what you might expect or the same? Does using phonology like this make sense? Compare to the tree in Bouckaert et al. (see reference below) if needed. Now check the synapomorphies again using the approach above `apo N` and `apo - N`. What's the result? ## Morphological data Now we're going to do the same thing but with grammatical features - morphology: ![](https://hackmd.io/_uploads/rkEGPWyc2.png) You can code this however you like, but I might try multitstate as it's a bit quicker: ![](https://hackmd.io/_uploads/HJCYvZy9h.png) ...and then try running in TNT with Greek as the outgroup again: ``` proc morphology.tnt Outgroup Greek hold10;mu50;bb; ne* tplot ``` Now check the synapomorphies once again using the approach explained under the cognate data section with `apo N` and `apo - N`. What's the result? Does it make sense? # Part 2: Bayesian analysis of language data This part of the practical is designed to get you familiar with Bayesian phylogenetics. We will use a linguistic dataset because this links to the earlier part of the prac and Lecture 12, and because this is quicker because it's smaller than DNA. However, the same approach (with slight modifications) could be used for DNA or morphology. We will use the software [MrBayes](https://nbisweden.github.io/MrBayes/) because it's very simple. There are others however, notably [BEAST](https://beast.community/), [BEAST2](https://www.beast2.org/) and [RevBayes](https://revbayes.github.io/). BEAST/BEAST2 are particularly useful for looking at divergence dates, and are what people usually use to do this in linguistic contexts too. We will have a look at a BEAST file, but not run it - do the block course *Phylogenetics in Evolution and Ecology* in September with [Dr. Faysal Bibi](https://www.museumfuernaturkunde.berlin/en/about/team/faysal.bibi) to try BEAST out! First, we will have a look at the file... *IE2011RelaxedCovarionAllSingletonsGeo.xml* ...which is a BEAST xml file, which is from the supplement of the Bouckaert et al. 2013 paper on Indo-European language expansion from Anatolia you saw in Lecture 12: > Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., ... & Atkinson, Q. D. (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337(6097), 957-960. ![](https://hackmd.io/_uploads/BJfkoqpFh.png) Let's open it in a text editor and have a look at it... Can you find the cognate (character) scorings, the language names, the longitude and latitude data, and enforced monophyly constraints (=commands saying certain groups have to be monophyletic, whatever the analysis yields)? ![](https://hackmd.io/_uploads/Bkjvi9TKn.png) You can look at the consensus tree generated from this data by Bouckaert et al. in [FigTree](http://tree.bio.ed.ac.uk/software/figtree/) - it's the file *1219669IndoEuropean2MCCtreesannotated.tre* Are there "fossil" languages present? Anything else to observe? I already took the liberty of extracting the cognate data from the xml file for the Germanic languages (so we can look at a subset), which includes English and German. It's in the file *Germanic_language_cognates.nex*. Open this in a text editor to have a quick look, and then also open in Mesquite to see it properly displayed if you like. We are going to analyze this data, first in TNT, then SplitsTree, and then MrBayes. ## Quick analysis in TNT I also already saved the .nex file as a .tnt file, *Germanic_language_cognates.tnt*. Quickly open that in a text editor to take a look at it. Now open TNT (make sure it's in the same directory as the TNT file) and type: ``` proc Germanic_language_cognates.tnt Outgroup Latin hold10;mu50;bb; ne tplot ``` You should have a tree of the Germanic languages, with Latin as an outgroup (the first taxon is automatically set as the outgroup if it is not specified). This might have node numbers on if you turned them on earlier. ![](https://hackmd.io/_uploads/BkHTsiaYn.png) Now let's try excluding Latin, and using Gothic as the outgroup: ``` proc Germanic_language_cognates.tnt Outgroup Gothic taxcode-0 hold10;mu50;bb; ``` If needed: `ne*` Then: ``` tplot ``` Does the topology look the same as with the Bouckaert et al. tree? If not, why not, do you think? ![](https://hackmd.io/_uploads/Sk35Zhpt3.png) For practice, we can try **mapping or listing synapomorphies** again, using `apo N` and `apo- N`. Unfortunately though, we don't have information about which word is which (Bouckaert et al. were annoyingly secretive!) so it's not really well interpretable... ## Looking at the data in SplitsTree As we saw before, language data doesn't always fit a tree that well because there is much horizontal transfer. For this reason, we can also have a look at the data in SplitsTree. Open SplitsTree, and open the file Germanic_language_cognates_simplified.nex. This is from the same matrix but saved from Mesquite by Export->Simplified nexus. What does this network tell us about the data? Which languages share the most data, and which differ most from other languages? ![](https://hackmd.io/_uploads/BJfBco6K3.png) ## Bayesian analysis in MrBayes Finally, we will undertake a Bayesian analysis with MrBayes. First, we will make a MrBayes file in Mesquite. Open the matrix Germanic_language_cognates_simplified.nex again, and do: File->Export NEXUS for MrBayes We **don't** want a **MrBayes block**, because we will set these parameters later, so you can **delete all the contents of this** and then save it as *germanicmrbayes.nex*: ![](https://hackmd.io/_uploads/SyndXnptn.png) Open MrBayes (or type `mb` in the terminal on Mac, after you have installed MrBayes using homebrew), making sure the MrBayes file is in the same folder (or on Mac, that the file is in the home directory). Now type the following: `execute germanicmrbayes.nex` Specify Mk+gamma model, which is one of the simplest `lset rates=gamma` Delete Latin and set the outgroup as Gothic (it's probably better, but up to you! Actually it has an interesting and maybe more accurate result when you keep both too, but it's a bit slower...): ``` delete Latin outgroup Gothic ``` Start the run whilst setting number of generations, sample frequency, the printing frequency, diagnostic frequency, and the fact that it should stop when the standard deviation of split frequencies reaches 0.01: `mcmc ngen=1000000 printfreq=100 samplefreq=100 diagnfreq=1000 nchains=4 stoprule=yes stopval=0.01` ![](https://hackmd.io/_uploads/rkV_s3Tth.png) This should be quite quick because the dataset is small, but be warned often these analyses take days! We can set the number of generations lower and the stopval higher to get a rougher result. Once it's finished, type the following: ``` sump sumt ``` It's important to make sure you do this before you quit MrBayes, or you basically don't have your results! **N.B. Errors:** You may get the error *Could not find parameter "RESPECTCASE"*. In this case just delete the word RESPECTCASE in the nexus file: ![](https://hackmd.io/_uploads/By9aD26Yn.png) You will also be told that with the current coding bias a lot of the characters were excluded - these are just invariable characters (because many languages are missing from this set because it's trimmed from the original data), so it's OK. Now we can look at our results. When you type sumt you will see a tree printed to the window already: ![](https://hackmd.io/_uploads/r1VV62pFn.png) This tree - the consensus tree showing posterior probabilities at nodes - is also written to the file *germanicmrbayes.nex.con.tre* in the MrBayes folder. You can open this in FigTree and have a look, and turn on the Node labels and choose the *prob* option under Display to see the posterior probabilities: ![](https://hackmd.io/_uploads/SyrGC2pYn.png) Congratulations - you have a Bayesian tree! Is the topology the same as with parsimony or different? Any thoughts why? Is it better or worse? What are the support values for particular groups? Why might some be higher than others? You can also have a look at the rest of the output from the analysis. In *Bayes output.txt* you have everything that printed to the terminal during the analysis. In *germanicmrbayes.nex.trprobs* you have all of the trees - in this case not so many - which yield the consensus tree. # Part 3: Taking phylogeny into account ## Conservation status of British birds Here you will use normal least squares and phylogenetic least squares regression with conservation-related data for British birds. This data is from: > Thomas, G. H. (2008). Phylogenetic distributions of british birds of conservation concern. Proceedings of the Royal Society B-Biological Sciences, 275(1647): 2077-2083. ![](https://hackmd.io/_uploads/BJ72UtaF2.png) We will look at two variables - range size (across the whole country) and population size. We would expect these to possibly be positively correlated with each other, because if there is a smaller range then there are likely to be fewer birds. However, this could be skewed, because e.g. on group of birds may tend to have very small population sizes and small ranges because of its biology (e.g. only lives at the top of mountains), so we want to check the effect of accounting for phylogeny. First, have a look at the files birddata.csv and birdtree.tre in a text editor. Now, we will read in the data and the tree (make sure you working directory is set to the folder where your files are - `setwd()` ): `birddata<-read.csv(file="birddata.csv",sep=",",header=TRUE)` `birdtree<-read.tree(file="birdtree.tre")` Now double check it all read in right by typing: `birddata` `birdtree` ...and for good measure... `plot(birdtree)` Now what we want to do is undertake a regression of population size against range size. We can do this with the gls() function of nlme. First load nlme and ape: `library(nlme)` `library(ape)` ...and now we are ready to undertake the regression, making an R object of the results: `gls.RP<-gls(Range_size~Pop_size,data=birddata)` Have a look at the object: `gls.RP` What do you see? Now summarize the regression result with: `summary(gls.RP)` Do you see the P value (the lower one)? Is there significant association? Also note the AIC value - it allows you to compare with a phylogenetic model fit in a moment. Now we will try the same regression, but this time taking phylogenetic correlation into account, using a Brownian-motion based correlation from the phylogeny: `pgls.RP<-gls(Range_size~Pop_size,data=birddata,correlation=corBrownian(phy=birdtree,form=~Binomial))` Look at the results: `pgls.RP` `summary(pgls.RP)` How do they differ? Why? Is it still significant? Comapre the AIC value - which one fits the data better given the number of parameters? Does the log likelihood agree with the AIC? This is the general approach to taking into account phylogeny - including it as an explicit correlation factor in the regression. In a moment we will also try this with language data... We can also test how much phylogenetic signal is in this data. To do this we'll use the phylosig() function from phytools, and Pagel's lambda. First let's make two objects with the data with a name for each data point: ``` ranges<-birddata$Range_size names(ranges)<-birddata$Binomial pops<-birddata$Pop_size names(pops)<-birddata$Binomial ``` Now let's calculate Pagel's lambda: ``` phylosig(birdtree,ranges,method="lambda") phylosig(birdtree,pops,method="lambda") ``` 1 means a high phylogenetic signal, 0 low. So is there a high or low phylogenetic signal? ## Bantu marriange customs and kinship customs In this part, you will look at categorical data from Bantu cultures in subsaharan Africa. These are whether people move in to the woman or the man's house after marriage ( matrilocal or patrilocal) and how descent (inheritance) occurs, whether via the monther's or father's line (matrilinial or patrilinial). There is also a language tree of the Bantu languages. This data is taken from Opie et al. 2014, but the tree doesn't have the original branch lengths (because I had to manually copy it from the figure!) but rather all set to one: > Opie, C., Shultz, S., Atkinson, Q. D., Currie, T., & Mace, R. (2014). Phylogenetic reconstruction of Bantu kinship challenges Main Sequence Theory of human social evolution. Proceedings of the National Academy of Sciences, 111(49), 17414-17419. ![](https://hackmd.io/_uploads/ByO4-byq2.png) First get the data into R: ``` bantu_data<-read.csv(file="bantu_data.csv",sep=",",header=TRUE) bantu_tree<-read.nexus(file="bantu_tree.nex") ``` Check them both: ``` bantu_data bantu_tree ``` Now you can regress the Residence_numeric (residence rules in numeric categories) against Descent_numeric, with and without phylogeny included, and compare the results: ``` gls.bantu <-gls(Descent_numeric~Residence_numeric,data=bantu_data) pgls.bantu <-gls(Descent_numeric~Residence_numeric,data=bantu_data,correlation=corBrownian(phy=bantu_tree,form=~Language)) ``` What is the difference? Are both significant? Has the correlation coefficient changed? What does this mean? Which one has a more negative AIC? Now you can see if there is significant phylogenetic signal in the data. This is a different method from with the bird data, because it is discrete categorical variables. Use the fitDiscrete function in geiger (make sure geiger is loaded) to calculate Pagel's lambda: First create separate objects with the data and give each number a name: ``` residence_numeric<-bantu_data$Residence_numeric descent_numeric<-bantu_data$Descent_numeric names(residence_numeric)<-bantu_data$Language names(descent_numeric)<-bantu_data$Language ``` Now fit the models: ``` fitDiscrete(bantu_tree,residence_numeric,model="ER",transform="lambda") fitDiscrete(bantu_tree,(descent_numeric+1),model="ER",transform="lambda") ``` Can you find Pagel's lambda? What does this tell us? # Additional note - copying a tree from a figure For the protocol, you will need a tree from the literature. Sometimes the tree file is not provided. One way around this is to contact the authors and ask for it. Alternatively, you can create the tree by hand in Mesquite. First create a matrix with the right number of taxa and no character matrix: ![](https://hackmd.io/_uploads/HJ3T8YZ93.png) Fill in the taxon names: ![](https://hackmd.io/_uploads/BJbzvtWch.png) Now press Taxa&Trees->New Tree Window->With Trees to Edit by Hand: ![](https://hackmd.io/_uploads/B1TmvF-c2.png) Edit the tree, using the tools to the left (hover over them to see what they do), and by clicking and dragging branches onto other branches: ![](https://hackmd.io/_uploads/rJjFwYZq2.png) Even when finished, this will not have branch lengths however, so you can set them all to 1, using Tree->Alter/Transform Branch Lengths->Set All Branche Lengths to 1.0 ![](https://hackmd.io/_uploads/ryEaPY-q3.png) Now it's ready for export. It's not ideal, because you don't have the true branch lengths, but better than nothing! Do File->Export then choose Export NEXUS File from Tree Source: ![](https://hackmd.io/_uploads/HJW4Ot-93.png) Then choose "Current tree": ![](https://hackmd.io/_uploads/HybwuYZc2.png) Keep the defaults for the rest, and save it as a .nex file, which you can then read into R with read.nexus() (from ape package). **Just get in touch if you have any trouble!** # References and resources Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., ... & Atkinson, Q. D. (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337(6097), 957-960. Opie, C., Shultz, S., Atkinson, Q. D., Currie, T., & Mace, R. (2014). Phylogenetic reconstruction of Bantu kinship challenges Main Sequence Theory of human social evolution. Proceedings of the National Academy of Sciences, 111(49), 17414-17419. Thomas, G. H. (2008). Phylogenetic distributions of british birds of conservation concern. Proceedings of the Royal Society B-Biological Sciences, 275(1647): 2077-2083. **Tutorial TNT for Mac and Windows:** https://pennyhiggins.com/2013/03/31/running-tnt-for-phylogenetic-analysis-on-a-mac/comment-page-1/ **Tutorial installation and use MrBayes:** https://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_11.pdf