# Notes 12th May 2025 OCSEAN linguistic collection, as it is now, also see the table attached: 1) a bit of overview in numbers Included to master files were: 22 Indonesia (inlc 4 Abui varieties, 2 Balinese varieties), number of words ranging from 94 to 1226 34 Philippines (incl 2-3? Hiligaynon varieties, 2 Ivatan varieties), number of words ranging from 239 to 1225 4    Philippines (but with problems with lots of words) NOT or NOT YET included to the Master file (problems listed in the table) 15 Philippines 3 Indonesia (Abui variety, Klon, file exaclty as Kelon, +Loloan Malay) Thre are additional languages that had attempts to be collected, but were not finished, are varieties, not enough data, data not considered of quality from the start. 22 from Indonesia 18    from Philippines 2)The DATA overview file is attached, it is part of the masterfile, input from George and Hanna, Kertu and Natalya is here -Master sheet - has lots of info about the languages that had some collection effort, there are also details of what was missing UnitedLanguageMasterFileUpdatesKertu - overview of the the languages that Kertu and Natalya were working with, how many words, what were separators, if these were added to the unites master file -note that still a number of languages are missing some metadata, like location Link to our language data cleaned files from Kertu (the ones that had 1) data, 2) consent in some form) : https://drive.google.com/drive/folders/132C_f9f2YpKc6NARyCx29bO1ll1Am3xQ?usp=share_link In the folder there are: -files with suffix *PostQC - these are individual language files -individual language files should mostly follow the naming criteria that was given during the summer school ' OCSEAN_LANGUAGE_3-letter-Code-Date-WORDLIST_numberOfwords ', but there were also files with a bit different naming, they have been kept this way for consistency. -#REFERENCE CODE in R for doing QC -R-commands for the process that has been used for cleaning and creating master files (CreatingMasterFileWithAllLanguages_DATE.R). -Ocsean data analyses overview by Kertu Jan 2025 - this is the log of what was done -Languages_uniteted_master* with separate dates - attempts to create the Masterfile with all available data -idea was to make files that would have separate lines for any cases that would have synonyms listed for any of the original meanings, I'm not sure if it was all successful, but there are many repeated lines. -Wordlist.xlsx - the full list of meaning that was collected -note that not all languages have the full list -There were several inconsistencies and problems with separators, I think all this is listed in the Loge file. We also decided at some point to do a master file only with Indonesian languages as these files were more straight forward (and much less files). 3) GENETIC dataset - below are the references that Max listed in the Genetic Data deliverable, I'm not sure this includes all the data that are included in the excel table though, Max, please update and all PHILIPPINE populations that have genetic data are missing from the language matching file. Brucato, N., André, M., Tsang, R., Saag, L., Kariwiga, J., Sesuki, K., Beni, T., Pomat, W., Muke, J., Meyer, V. and Boland, A., 2021. Papua New Guinean genomes reveal the complex settlement of north Sahul. Molecular Biology and Evolution, 38(11), pp.5107-5121. Choin, J., Mendoza-Revilla, J., Arauna, L.R., Cuadros-Espinoza, S., Cassar, O., Larena, M., Ko, A.M.S., Harmant, C., Laurent, R., Verdu, P. and Laval, G., 2021. Genomic insights into population history and biological adaptation in Oceania. Nature, 592(7855), pp.583-589. Jacobs, G.S., Hudjashov, G., Saag, L., Kusuma, P., Darusallam, C.C., Lawson, D.J., Mondal, M., Pagani, L., Ricaut, F.X., Stoneking, M. and Metspalu, M., 2019. Multiple deeply divergent Denisovan ancestries in Papuans. Cell, 177(4), pp.1010-1021. Larena, M., Sanchez-Quinto, F., Sjödin, P., McKenna, J., Ebeo, C., Reyes, R., Casel, O., Huang, J.Y., Hagada, K.P., Guilay, D. and Reyes, J., 2021. Multiple migrations to the Philippines during the last 50,000 years. Proceedings of the National Academy of Sciences, 118(13), p.e2026132118. Malaspinas, A.S., Westaway, M.C., Muller, C., Sousa, V.C., Lao, O., Alves, I., Bergström, A., Athanasiadis, G., Cheng, J.Y., Crawford, J.E. and Heupink, T.H., 2016. A genomic history of Aboriginal Australia. Nature, 538(7624), pp.207-214. Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A. and Skoglund, P., 2016. The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature, 538(7624), pp.201- 206. Vernot, B., Tucci, S., Kelso, J., Schraiber, J.G., Wolf, A.B., Gittelman, R.M., Dannemann, M., Grote, S., McCoy, R.C., Norton, H. and Scheinfeldt, L.B., 2016. Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science, 352(6282), pp.235-239. # Notes from Johannes Meeting 16th May 2025 * IPA is "optional", the value is to make it more useful in the future * It might be possible to provide "phonemic" data * "phonetic" is a lot harder * Goal: present phylogeny of the language * cognacy etc * "confident enough that we are not overtly wrong" * Quick impression on making IPA from phonemic data: * clts.clld.org/parameters uses "CLTS" cross-linguistic transcription system * Each "grapheme" (i.e. letter in the phonetic system) is mapped between different phonetic systems * Use [pyclts](https://pypi.org/project/pyclts/) * Define a "transcription system" to map to other transcription systems * Tree inference tools don't have ways to "mask out" recent borrowing * Claim is that "this is negligible"/ handled from varying rates * Network approaches are "not convincing" * Borrowing detection approaches are not really there * Use a swadish list or similar etc * Cultural packages identify borrowing etc * Johannes has a tool called Tasiendis (sp?) which is useful for advanced tasks, but Lingpy/CLTS might be enough for now