Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity

# Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity The corpus consists of pairs of sentences with semantic similarity scores based on human judgments, allowing experimentation with both PI and semantic similarity.The data collection and scoring methodology is described, and the corpus and first PI experiments are reported. Their approach to PI is new to using 'lean knowledge' methods (i.e. not using manually created knowledge bases or processing tools based on them). ## Dataset Details TuPC creation strategies combine methodologies from the prebuilt corpora MSRPC [30] and TPC [34]. They automatically extracted and matched sentences from daily news sites. The candidate couples were then hand-explained by examining their context. Unlike MSRPC, but like TPC, candidate pairs were scored according to semantic similarity. | Label | Description | |-------|-------------| | 0 | UNRELATED On different topics | | 1 | SOMEWHAT RELATED Not equivalent, but are on the same topic. | | 2 | CONTEXT Not equivalent, but share some details. | | 3 | RELATED Roughly equivalent; some important information differs/missing. | | 4 | CLOSE Mostly equivalent; some small details differ. | | 5 | IDENTICAL Completely equivalent; mean the same | For example, (1,3) shows that only one annotator tagged the pair as a paraphrase, while the remaining three labeled it a non-paraphrase. ### Samples The criteria of binary judgement based on the number of annotator’s answers: | Number of answers | Judgement | |-------|-------------| | (4,0); (3,1) | Positive (1) | | (0,4); (1,3) | Negative (0) | | (2,2) | Debatable | They also provide average scores for semantic similarity.These range between 1.75 and 3.00 for debatable pairs, whereas positive pairs are higher than 3.00 and negative pairs are scored less than 1.75. Additionally, the criteria defined in Table 2 can be interpreted in a range between 0 and 1 as follows: (4,0): 1.0; (3,1): 0.75; (2,2): 0.50; (1,3): 0.25 and (0,4): 0.0. ### Fields Table presents three sample pairs from the data. Each pair is shown with the scores of 4 different annotators and the average similarity scores. The debatable pair has been scored (4,2,3,0) by four annotators, and the average similarity is 2.25. | Value | Scores | Pair of Sentences | |----------|------------|-------| | Debatable | (4,2,3,0) Average (2.25) | İşadamı Ethem Sancak, Aydın Doğan ve Ertuğrul Özkök ile ilgili "Bazı şeyleri açıklarsam Türkiye'de barınamazlar" dedi. 24 TV'de konuşan İşadamı Ethem Sancak ''Doğan Grubu, Aydın Doğan ve Ertuğrul Özkök'le ilgili çarpıcı açıklamalar yaptı.| | Positive | (3,4,5,5) Average (4.25)| Çekilişte 10 rakamı isabet eden 15 kupondan 13'ünün Muğla'nın Yatağan ilçesindeki bayilere yatırıldığı ortaya çıktı 10 rakamı isabet eden 15 kupondan 13'ü Muğla'nın Yatağan ilçesindeki bayi ya da bayilerden yatırıldı. | |Negative |(1,3,0,0) Average (1.00)|Toplam konut satışları içerisinde ipotekli satış payının en yüksek olduğu il ise yüzde 52,9 ile Kars oldu. Toplam konut satışları içinde ilk satışın payı yüzde 45,4 oldu.| **TuPC data statistics:** | | Agreement| Number of pairs | Value | |----------|------------|-------|-------| | Positive | (4,0) , (3,0) | 376 , 187 | 563 | | Debetable | (2,2) | 154 |154| |Negative|(1,3) , (0,4)|151 , 134|285| ### Splits Indicate the train/validation/test split sizes. Example: | Training | Test | |----------|------------| | 60% (500 pairs) | 40% (348 pairs) | ## Dataset Creation ### Curation Rationale The corpus comprises pairs of sentences with semantic similarity scores based on human judgments, permitting experimentation with both PI and semantic similarity. That this is the first corpus for Turkish they believe. ### Data Source A simple web browser has been implemented to extract the plain text and cluster it according to a pre-selected subtopic list. A list of URLs was gathered from a website that links to most daily Turkish newspapers. Then another list of URLs was extracted for each newspaper. Commonly used title tags were determined by looking at the categories on each website. For example, the latest news can be found under the heading “last minute” on one site and under the heading “last minute” on the other. A subset of popular headlines has been created to pull news on the topic. Some examples: ``` { [haber, gundem, guncel, sondakika, haberler, turkiye, haberhatti,...,] } ``` ### Quality | Scoring | Fleiss Kappa (%) | |----------|------------| | Degree of Semantic Equivalency (0-5) | 0.17 | | Binary Judgment (1,0,Debatable) | 0.42 | ## Additional Information ### Dataset Curators "Published by Asli Eyecioglu,, Bill Keller" ### Citation Information Please cite the following paper if you found this dataset useful: Eyecioglu, Asli & Keller, Bill. (2016). Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity. Lecture Notes in Computer Science. Computational Linguistics and Intelligent Text Processing. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics.. 562-574. ``` {article, author = {Eyecioglu, Asli and Keller, Bill}, year = {2016}, month = {01}, pages = {562-574}, title = {Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity}, volume = {Computational Linguistics and Intelligent Text Processing. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics.}, journal = {Lecture Notes in Computer Science} } } ```