# OCSEAN Data Science Workshop Report **Date:** June 2025 **Authors:** John Lennon L. Calorio, I Made Sena Darmasetiyawan, Putu Wahyu Widiatmika, I Komang Sumaryana Putra, Dendi Wijaya, Christopher Kinipi, Daniel Lawson **Abstract:** This technical report documents the activity of the Bristol Computational Linguistics and Data Science workshop, May-June 2025 in Bristol, UK. It comprises of the following activities: 1. Linguistic Data Standardization 2. Linguistic Initial Analysis 3. Linkage The inputs are wordlists from previous workshops, and the output is a standardized word and concept list suitable for comparative linguistics. ## 1. Linguistic Data Standardization The code to perform this cleaning is located in the [OCSEAN workshop github](https://github.com/danjlawson/ocseanworkshop2025/) as [01-process_linguistic_excel_data](https://github.com/danjlawson/ocseanworkshop2025/blob/main/01-process_linguistic_excel_data.ipynb). Inputs: 1. [OCSEAN_language_collection_overview_April_25.xlsx](https://docs.google.com/spreadsheets/d/1RUWazX1T68fyK4T38dtEAMTpe1uwkSXC/edit?usp=sharing&ouid=102924535421804298439&rtpof=true&sd=true): The overview of the entire language collection. Note that this is updated to [OCSEAN_language_collection_overview_June_25.xlsx](https://docs.google.com/spreadsheets/d/1aqpIG47b3RfELu0uLAlaVcWaa5zUd5R5/edit?usp=sharing&ouid=102924535421804298439&rtpof=true&sd=true). 2. `CleanedFiles` consisting of 79 `PostQC.xlsx` files containing the wordlists. Outputs: 1. `OCSEAN_language_collection_overview_June_25.xlsx`: An update of the language collection. 2. `CleanedFiles-v1.1` consisting of 61 files with edited wordlists called `edited-XXX` where `XXX` is their name in the `CleanedFiles` folder, while the remaining 19 files were from the `CleanedFiles` folder. 3. `OCSEAN_processed_joineddata.tsv` consisting of every form of every concept for every language provided that survived quality control. ### 1a. Sanitisation * What did we do with the languages? Comparisons were based on the forms to compare these languages - omitting the IPA, since most languages do not have IPA completely recorded. In summary, only 13 languages have both IPA and English columns. There was only one (1) language that only has IPA column, while 65 other languages only have the English column. In total, 14 languages have IPA, and 78 languages have English (see footers 1 to 3 for these languages). The Philippines had the most languages included in the analysis, comprising of almost 68.4% of the total. About 30.4% of the languages were from Indonesia, while the remaining 1.3% were from Papua New Guinea. ![image](https://hackmd.io/_uploads/HyOuV6t4ll.png) **Figure __:** Number of languages per country * What was done with the 'special' characters? 1. First of all, in terms of glottal stops, since most languages have several variations of this, we have universified it into the character 'ʔ'. Furthermore, glottal stops at the very beginning of several languages forms were also removed, prior to any further processes (with the exception of several languages such as 'Abui Bunggeta', 'Abui Kilakawada', 'Abui Mobyetang', and 'Abui Pelman' that were confirmed to employ these glottal stops). 2. Subsequently, any variations with the letters, i.e., letters having accents like 'à', 'ѐ', 'í', among others, were converted to regular English alphabet characters such as a, e, i, respectively. This decision was made as there was considerable variation in how sounds were encoded in different languages (i.e., they did not have a consistent orthographical rule) and moreover they represent consistently predictable sound change that does not portray a per-word phonemic signal. 3. There were also forms with parenthesis found in the languages, i.e., 'cave (kweba)' that has its paranthesis '(kweba)' removed, due to the fact that both forms has the same meaning (concept) semantically. 4. Removal in both forms and concepts from these languages were also done in the case of any trailing spaces, new lines, tabs, carriage returns, non-and breaking spaces, which were replaced with "." when a word-separator was needed. All changes here were meant to tidy-up the data, since data collection were done by various different teams. In total 1349 corrections were automated. 5. In the output, cleaned 'forms' now only contains the letters of the alphabet plus the characters '.', 'ŋ', 'ɲ', and 'ʔ' (which are used consistently with IPA). 6. Some words had multiple forms given. Where this was computationally feasible, this was done automatically; e.g. some language files use "~", ",", "/" or ";" separators. In other cases, multiple forms were separated with parenthesis which were corrected manually. This problem arises when different form entries are inserted when one concept is represented in more than one form. The variation of forms can occur due to differences in how a language encodes meaning, which may be influenced by cognitive factors (such as perception, categorization, or metaphorical thinking) and environmental factors (such as cultural practices, ecological context, or contact with other languages). These influences lead to the use of distinct forms to express what may appear to be a single underlying concept. 7. The compilation of these changes can be tracked in this [Google Sheets link](https://docs.google.com/spreadsheets/d/1NXtKuFkrvVsehFdRse8D1fVq7DoKlgci/edit?usp=sharing&ouid=102924535421804298439&rtpof=true&sd=true). **Table __:** Replacements and action done to bad symbols | Replacement/Action Taken | Bad symbols | Description | |:------------------------:|:------------------------------------------------------:|:---------------------------------:| | . | ' ', '–', '-' | spaces and dashes to a dot/period | | (Removed) | '=', '?', '', '…', '!', '̃', '̇', '̝', '̪' | | | ʔ | '´', "'", '’', '´', '‘' | universified glottal stops | | a | 'à', 'â', 'ᵄ' | a variants to 'a' | | e | 'ė', 'ѐ', 'è', 'é', 'ẽ', 'ĕ', 'ê', 'ə', 'ɛ', 'ḗ' | e variants to 'e' | | i | 'ì', 'í', 'î', 'ᵎ' | i variants to 'i' | | o | 'ò', 'ó', 'ό', 'ô', 'ö', 'ō', 'ŏ', 'ɔ', 'ɔ̝', 'ṑ', 'ṓ' | o variants to 'o' | | u | 'ù', 'ú', 'û' | u variants to 'u' | | n | 'ǹ', 'n̪', 'ň' | n variants to 'n' | | ɲ | 'ñ' | enye to 'ɲ' | | aa | a: | | | ii | i: | | | oo | oː | | | uu | u: | | | c | tʃ | | | j | dʒ | | | g | ɡ | g variant to normal g | The figure below summarizes the number of times these 'bad symbols' appeared in the data prior to data cleaning. ![bad_symbols_frequency](https://hackmd.io/_uploads/rJtnEaKVge.png) **Figure __:** Frequency of bad symbols in word column. While the figure below summarizes the symbols considered to be acceptable. ![accepted_symbols_frequency](https://hackmd.io/_uploads/rkFRVaYExl.png) **Figure __:** Frequency of accepted non-English language symbols in word column. ### 1b. Standardization Due to standardization of word lists, most concepts are recorded in most languages. However, looking at the distribution of how many languages share a concept, we see that there is a real split between "common" concepts and "rare" concepts. In terms of rare concepts, from all languages, the natural separator is for entries that only occured less than 25 times, which can be regarded as rare concepts. Some rare concepts may arise from accidental data entry errors, such as missing or misplaced characters, incorrect translations, or misalignments during data compilation. Another reason is certain languages included in the dataset may have been documented with specific linguistic or cultural goals in mind. For instance, in the MHP dataset (Loloan Malay), the researchers collected the wordlist have prioritized vocabularies related to cultures. Finally, when elucidating word lists participants may give multiple concepts with clarification over usage that results in new concepts being introduced. These rare concepts can be found [in this link](https://docs.google.com/spreadsheets/d/1mLflELr2G25oTy9FJAFmyybEZNcinUXw/edit?usp=sharing&ouid=102924535421804298439&rtpof=true&sd=true). In total, 1,404 rare concepts were recorded. Of which, 635 rare concepts were corrected, because the rare concepts suggested through computational similarity was a better version of it; i.e., missing letter for a single word or additional letter that does not make any sense (possibly due to typing error), while the remaining 769 were left as they were because each of them is unique and does not have corresponding meaning to the rest of the concepts in any wordlists. The figure below tells us how many rare concepts were found in each language. Loloan Malay had the highest number of rare concepts in this case, which is over 800. This tells us that this language had the most unique concepts among the other languages. This language was collected seperately and so did not use the OCSEAN concept list. ![Untitled](https://hackmd.io/_uploads/H1IiYcO4xx.png) **Figure __:** Number of rare concepts per language The figure below describes how many times the rare concepts appeared in all languages. For example, the concepts that appeared only once in languages are summarized in the 1 bar. This means that there are a lot of concepts that appeared only once in 1 or more language. On the far right, we see that there are way lesser concepts that appeared 25 times in all languages. It is kind of similar to the figure above, but this tells us the frequency of concepts appearing in all languages. ![histogram_rare_concepts](https://hackmd.io/_uploads/H18gr6YNle.png) **Figure __:** Histogram of the number of rare concepts ### 1c. Metadata Metadata has 79 different languages across Indonesia, Philippines, and Papua New Guinea that can be analyzed through cognated forms with consideration to its genetic and geographical coordinates. However, due to the lack of entries in Ibanag (where it only has IPA column - no corresponding English), only 78 languages were taken into account in the analysis. Following the computational changes to metadata, there were several cases where languages have too many unique errors (the identification to these cases were done computationally), which meant that it would be more efficient to revise them manually - this was based on each language that stored in `CleanedFiles-v1.1`. ## 2. Linguistic Initial Analysis This section of the report is still in development. The [latest pipeline in Github](https://github.com/jllcalorio/ocsean-data-science-workshop-2025/blob/main/process_linguistics_data.ipynb) includes the processes involving the determination of languages having IPA and English columns, correcting the rare concepts, separating different concepts into several rows, and standardizing characters. The latest results are stored in this [Google Drive link](https://drive.google.com/drive/folders/1CqubLYbtx8qo_eEQl0O31Kejz1QVBBAi?usp=sharing). The figure below shows how many concepts there are in a language. The concepts considered here are only those having their corresponding forms in that language. For example, the language 'Sabu Seba' has 1,290 concepts, however, it only has 100 forms, which corresponds to its small number of concepts. In total, there were 76,692 concepts recorded in the cleaned dataset across 78 languages. ![image](https://hackmd.io/_uploads/r1WBSaK4lg.png) **Figure __:** Number of rare concepts (with forms) per languge The analysis to this project was done using LingPy based on List et al (2018) that can automatically detect cognates from the phonemic entries and align it across multiple languages. The algorithms used LexStat and SCA to compare, identify, and group these similarities. Results in the languages were then compared to the genetics mapping. The following plots provide general overview of the findings: ![OCSEAN_Clarity_Genes_Language_page-0001](https://hackmd.io/_uploads/SyPueLt4lx.jpg) **Figure __:** (Rename me) **Borrowing case** This case is not recommended in the meantime - due to too many uncertainties in the analysis (it can still be mentioned in 'future consideration'). The differences between cases of borrowing and inheritance are too thin and unique depending on each language. Semantic field provided in the WOLD CLLD website matches the number of semantic fields in the wordlist, however, the list in that website also contains significant number of errors (i.e., in Balinese, where the word 'sepi' clearly has a Balinese word of 'suwung'). ## 3. Linkage This section of the report is still in development. Generally, both languages and genes distribution showed similar pattern. Some languages that showed strong level of cognates are also reflected in their corresponding genetic analysis. Prediction from both languages and genes were also overlapping - which accentuate validity of the findings. Results from LingPy analysis also established the ground to languages family tree (i.e., Austronesian, Indo-European, or Trans-New Guinea) despite their geographical coordinates where these languages are actively used. These languages are marked with scattered white brackets in the plot. (note: need examples of languages using specific words) In terms of genes, changes of the colours in the plot would suggest how population migrated from one area to another - preferences might due to various topographical, social, and cultural aspects. Overlapping colours in the plot can also suggest strong level of assimilation in these aspects. (note: also need examples of, perhaps, cross culture marriages) Furthermore, when combined to the languages, the finding above would also reflect strong language maintenance amongst geographical surrounding languages, as well as pointing out the characteristics of a specific language in the face of influences from languages of different language trees. --- **Footer:** 1. Languages with 'both' IPA and English columns (n = 13): Abui Bunggeta, Ba'a, Enggano, Hiligaynon_20220813, Ibaloy/ Ibaloi, Ivatan Isabtangan, Ivatan Itbayaten / Ivatan_Ichbayatan, Mawesdai, Sabu Seba, Sinurigaonon, Tagakaulo, Tuwali, and Umbu-Ungu. 2. Language with IPA column only (n = 1): Ibanag 3. Languages with English column only (n = 65): Abui Kilakawada, Abui Mobyetang, Abui Pelman, Adang, Agta, Agusan Manobo, Agutaynen, Akeanon, Alurung, Arta, Ata, Ati, Balangao, Bali Aga, Balinese, Batak, Bicolano, Boholano, Bolinao, Bontoc, Buhid, Bulus, Chabacano Caviteño , Cuyunon, Dela, Gaddang, Hanunuo, Hattang Kaye, Hiligaynon_20240802, Ilognon (Hiligaynon), Ilokano, Inabaknon, Kalinga, Kamayo, Kankana-ey, Kapampangan, Kelon, Kinaray-a, Kolibogan, Kupang Malay, Kusa, Loloan Malay, Manea, Manobo Tasaday, Mauta, Mawes Wares, Meranao, Minamanwa, Molbog, Obo Manobo, Palawano, Pangasinan, Reta Ternate, Sabu Raijua, Sangil, Sinama Banguingui, Sinama Sitangkai, Sinama Tabawan, Tagbanua Central, Talaandig (Binukid), Tausug, Tboli, Uab Meto, Waray, and Yakan 4. Languages that has not been QC'd (n = 36): Abui Takalelang, Alorung/Alorese Alor Kecil, Amanuban, Bagobo Tagabawa, Blaan Sarangani, Blagar Pantar Timur, Butuanon, Casiguranin, Cebuano, Dumagat San Luis Aurora, Higaonon, Ilongot, Iranun, Iraya, Ivatan, Kagan, Kamang Bukapiting, Mentawai Buttui, Mentawai Muntei, Rote, Dengka, Rote, Talae, Rote, Tii, Sabu Seba Ethnobotany, Sabu, Liae, Sambal, Savu, Sinama Davao, Subanen, Sumba Wejewa, Sumba, Anakalang, Sumba, Loli, Tagbanwa Tandulanen, Termanu, Tetun, Atambua, Umajamnon, Wersing Pantai Selatan. # To do tasks: 1. Check "OCSEAN_rareconcepts_matching_correction" versions.