Turkish English Parallel Corpus

Turkish English parallel corpus is created for phrase-based statistical machine translation from English into Turkish. The dataset only consists of Turkish part of the parallel corpus. There is a lack of large scale parallel text resources, therefore this parallel corpus is created by augmenting the limited training data with content words and phrases.

The parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc.

Dataset Details

This dataset only consist of Turkish part of the corpus. Turkish data represented with their morphological segmentation but instead of surface morpheses, lexical morpheses are used to abstract away word-internal details and conflate statistics for seemingly different suffixes.

The Turkish side of the data consists of 56609 sentences. These sentences include 13767 unique words. Data contains 102998 morphological tags, 18237 of which are unique along with 18006 unique roots and 231 unique suffixes.

Samples

Example:

her şey +ablative önce , kamu maliye+sh +locative denge sağla+hn+ma+sh için önlem alın+mak zor+sh +locative +dhr

Note that when the morphemes/tags (tokens starting with a +) are concatenated, we get the “word-based” version of the corpus, since surface words are recoverable from the concatenated representation

Fields

sentences are represented with “full” words. Turkish sentences are represented tokens representing root words, bound morphemes/tags. For example for the examples above, the four tokens [sağla] [+hn] [+ma] [+sh] would be used on the Turkish side.

Dataset Creation

Curation Rationale

The dataset is motivated by the lack of large parallel corpus for statistical machine translation.

Data Source

The parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc.

Annotations

The words in our Turkish corpus are segmented into lexical morphemes whereby differences in the surface representations of morphemes due to word-internal phenomena are abstracted out to improve statistics during alignment. Once the contextually salient morphological interpretation is selected, the curators discard the features leaving behind the lexical morphemes making up a word.

From these morphologically segmented corpora, for each sentence, the sequence of roots for open class content words (nouns, adjectives, adverbs, and verbs) are extracted. For Turkish, this corresponds to removing all morphemes and any roots for closed classes. This is used to augment the training corpus and bias content word alignments such that additional noise will not be added to align better.

Additional Information

Dataset Curators

Published by Kemal Oflazer. Ilknur Durgar El-Kahlout was interesred in a lot of the details and experiments. Seyma Mutlu implemented the word-repair code.

Citation Information

Please cite the following paper if you found this dataset useful:

Oflazer, Kemal. (2008). Statistical Machine Translation into a Morphologically Complex Language. 376-387. 10.1007/978-3-540-78135-6_32.

@inproceedings{inproceedings,
author = {Oflazer, Kemal},
year = {2008},
month = {02},
pages = {376-387},
title = {Statistical Machine Translation into a Morphologically Complex Language},
isbn = {978-3-540-78134-9},
doi = {10.1007/978-3-540-78135-6_32}
}

Turkish English Parallel Corpus

Dataset Details

Samples

Fields

Dataset Creation

Curation Rationale

Data Source

Annotations

Additional Information

Dataset Curators

Citation Information

Read more

TRopBank: Turkish PropBank V2.0

HistNet

Turkish WordNet KeNet

Turkish FrameNet