Turkish English parallel corpus is created for phrase-based statistical machine translation from English into Turkish. The dataset only consists of Turkish part of the parallel corpus. There is a lack of large scale parallel text resources, therefore this parallel corpus is created by augmenting the limited training data with content words and phrases.
The parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc.
This dataset only consist of Turkish part of the corpus. Turkish data represented with their morphological segmentation but instead of surface morpheses, lexical morpheses are used to abstract away word-internal details and conflate statistics for seemingly different suffixes.
The Turkish side of the data consists of 56609 sentences. These sentences include 13767 unique words. Data contains 102998 morphological tags, 18237 of which are unique along with 18006 unique roots and 231 unique suffixes.
Example:
Note that when the morphemes/tags (tokens starting with a +) are concatenated, we get the “word-based” version of the corpus, since surface words are recoverable from the concatenated representation
sentences are represented with “full” words. Turkish sentences are represented tokens representing root words, bound morphemes/tags. For example for the examples above, the four tokens [sağla] [+hn] [+ma] [+sh] would be used on the Turkish side.
The dataset is motivated by the lack of large parallel corpus for statistical machine translation.
The parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc.
The words in our Turkish corpus are segmented into lexical morphemes whereby differences in the surface representations of morphemes due to word-internal phenomena are abstracted out to improve statistics during alignment. Once the contextually salient morphological interpretation is selected, the curators discard the features leaving behind the lexical morphemes making up a word.
From these morphologically segmented corpora, for each sentence, the sequence of roots for open class content words (nouns, adjectives, adverbs, and verbs) are extracted. For Turkish, this corresponds to removing all morphemes and any roots for closed classes. This is used to augment the training corpus and bias content word alignments such that additional noise will not be added to align better.
Published by Kemal Oflazer. Ilknur Durgar El-Kahlout was interesred in a lot of the details and experiments. Seyma Mutlu implemented the word-repair code.
Please cite the following paper if you found this dataset useful:
Oflazer, Kemal. (2008). Statistical Machine Translation into a Morphologically Complex Language. 376-387. 10.1007/978-3-540-78135-6_32.