Penn Annotated Corpus 15

# Penn Annotated Corpus 15 https://github.com/olcaytaner/TurkishAnnotatedCorpus-15 https://github.com/olcaytaner/TurkishAnnotatedTreeBank-15 This dataset presents the first multilayer annotated corpus for Turkish, which is a low-resourced agglutinativelanguage. Our dataset consists of 9,600 sentences translated from the Penn Treebank Corpus. Annotated layers contain syntactic and semantic information including morphological disambiguation of words, named entity annotation, shallow parse, sense annotation, and semantic role label annotation. The Original peper can be accesed from [here](https://ieeexplore.ieee.org/abstract/document/8374369). ## Dataset Details The corpus currently contains 9,600 sentences, the original English counterparts of which are taken from Penn-Treebank. Annotated layers include morphological disambiguation of words, named entities, shallow parse, senses, and semantic role labels. ``` {turkish=yatırımcılar} {analysis=yatırımcı+NOUN+A3PL+PNON+NOM} {semantics=0841060}{namedEntity=NONE} {shallowParse=¨OZNE}{propbank=ARG0:0006410} ``` As is self-explanatory, "turkish" tag shows the original Turkish word; "analysis" tag shows the correct morphological parse of that word; "semantics" tag shows the ID of the correct sense of that word; "namedEntity" tag shows the named entity tag of that word; "shallowParse" tag shows the semantic role of that word; "propbank" tag shows the semantic role of that word for the verb synset id (frame id in the frame file) which is also given in that tag. ### Samplesd Example: ``` {turkish=Devasa}{morphologicalAnalysis=devasa+ADJ}{metaMorphemes=devasa}{semantics=TUR10-0199270}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=2$AMOD} {turkish=ölçekli}{morphologicalAnalysis=ölçek+NOUN+A3SG+PNON+NOM^DB+ADJ+WITH}{metaMorphemes=ölçek+lH}{semantics=TUR10-0600610}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=4$AMOD} {turkish=yeni}{morphologicalAnalysis=yeni+ADJ}{metaMorphemes=yeni}{semantics=TUR10-0848700}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=4$AMOD} {turkish=kanunda}{morphologicalAnalysis=kanun+NOUN+A3SG+PNON+LOC}{metaMorphemes=kanun+DA}{semantics=TUR10-0411070}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=5$OBL} {turkish=kullanılan}{morphologicalAnalysis=kullan+VERB^DB+VERB+PASS+POS^DB+ADJ+PRESPART}{metaMorphemes=kullan+Hl+yAn}{semantics=TUR10-0327720}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=9$ACL} {turkish=karmaşık}{morphologicalAnalysis=karmaşık+ADJ}{metaMorphemes=karmaşık}{semantics=TUR10-0422250}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=9$AMOD} {turkish=ve}{morphologicalAnalysis=ve+CONJ}{metaMorphemes=ve}{semantics=TUR10-0816400}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=8$CC} {turkish=çetrefilli}{morphologicalAnalysis=çetrefil+NOUN+A3SG+PNON+NOM^DB+ADJ+WITH}{metaMorphemes=çetrefil+lH}{semantics=TUR10-0160820}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=6$CONJ} {turkish=dil}{morphologicalAnalysis=dil+NOUN+A3SG+PNON+NOM}{metaMorphemes=dil}{semantics=TUR10-0512360}{namedEntity=NONE}{propbank=ARG0$TUR10-0122530}{shallowParse=ÖZNE}{universalDependency=11$NSUBJ} {turkish=kavgayı}{morphologicalAnalysis=kavga+NOUN+A3SG+PNON+ACC}{metaMorphemes=kavga+yH}{semantics=TUR10-0135880}{namedEntity=NONE}{propbank=NONE}{shallowParse=NESNE}{universalDependency=11$OBJ} {turkish=bulandırdı}{morphologicalAnalysis=bulan+VERB^DB+VERB+CAUS+POS+PAST+A3SG}{metaMorphemes=bulan+DHr+DH}{semantics=TUR10-0122530}{namedEntity=NONE}{propbank=PREDICATE$TUR10-0122530}{shallowParse=YÜKLEM}{universalDependency=0$ROOT} {turkish=.}{morphologicalAnalysis=.+PUNC}{metaMorphemes=.}{semantics=TUR10-1081860}{namedEntity=NONE}{propbank=NONE}{universalDependency=11$PUNCT} ``` ### Fields Explain the fields of the instances. | field | dtype | |----------|------------| | turkish | string | | analysis | string | | semantics | string | | namedEntity | string | | shallowParse | string | | shallowParse | string | ### Splits Indicate the train/validation/test split sizes. Example: | Training | Validation | Test | |----------|------------|-------| | 8658 | 359 | 540 | ## Dataset Creation ### Curation Rationale Most of the NLP studies focus on analytic languages like English and many other Indo-European languages, whereas studies on agglutinative languages like Turkish are limited in this field. Agglutinative languages, in general, are arguably more difficult to work on than others, due to the fact that a word may get numerous different meanings via the use of morphological markers, such as affixes. ### Data Source Sentences contained in this dataset are taken from Penn TreeBank. ### Annotations Annotation process is explained as follows in the original paper: > For the annotation, we are using an in-house NLP Toolkit, which supports all the operations mentioned in the sections above. To accomplish the annotation, we integrated corresponding editors (morphological disambiguation editor, entity annotation editor, shallow parse editor, word sense annotation editor, predicate editor, argument editor) to our toolkit in order to use the same infrastructure. The same annotation logic applies for all editors. Words in the middle area are clickable. Once the user clicks a word, a possible item (morphological analyses for morphological disambiguation, entity tags for entity annotation, shallow parse tags for shallow parsing, distinct senses for word sense annotation, roles for argument annotation) that can be assigned to the node pops up. After the selection is made, the selected item is printed below the node. Since we use translated sentences from Penn Treebank, all sentences have their English counterparts. These sentences can guide and help annotators to check their annotation. > > We worked with six annotators, all undergraduate students of Is¸ık University. Video guidelines for all editors for annotation were prepared based on guidelines provided by linguists. These annotators were trained before starting to annotate files. The corpus was divided into six equal parts and each part was assigned to a single student. Details about the annotation process of each annotation layer can be found in the original paper. ### Quality For the evaluation of the annotated dataset, we used an inter-annotator agreement measure. Two different group of annotators annotated the same sentences. Due to a lack of time, we could only re-annotate 100 of the total number of 9,600 sentences. Inter-annotator agreement scores, expected agreement scores and Cohen’s kappa coefficients are given in Table. The expected inter-annotator agreement is calculated by assuming that the annotators annotated completely randomly. | Layer | Agreement | Expected Agreement | Cohen’s Kappa | | -------- | -------- | -------- | -------- | | NER | 0.975 | 0.167 | 0.969 | | Shallow Parse | 0.79 | 0.167 | 0.748 | | Sense Annotation | 0.785 | 0.545 | 0.527 | ## Additional Information ### Dataset Curators Olcay Taner Yıldız, Koray Ak, Gökhan Ercan, Ozan Topsakal, Cengiz Asmazoğlu ### Citation Information Please cite the following paper if you found this dataset useful: Olcay Taner Yıldız, Koray Ak, Gökhan Ercan, Ozan Topsakal, Cengiz Asmazoğlu. 2018. A multilayer annotated corpus for Turkish. 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria. ``` @inproceedings{yildiz_penn15, title={A multilayer annotated corpus for Turkish}, author={Olcay Taner Yıldız, Koray Ak, Gökhan Ercan, Ozan Topsakal, Cengiz Asmazoğlu}, booktitle={2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP)}, year={2018} } ```