SplitExp

A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression.

github repo of dataset: https://github.com/ideateknoloji/SplitExp

Dataset Details

This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case.
cases of end-of-sentence markers:

case percentage
quotation %21.2
numbers %8.7
abbreviations %2.2
extensions %0.4

Samples

Samples of data instances from all types of data present in the dataset.

Example:

{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}

Fields

Explain the fields of the instances.

field dtype
id id of the token
text token
indexes ?
types ?

Splits

Experiments were conducted on 9343 end-of-sentence markers obtained from 685 unambiguous documents by means of a marking tool developed for testing.

Dataset Creation

Curation Rationale

The motivation of this dataset is that it aims to develop a method for determining the boundary of sentences for news texts by including direct quotations that have not been addressed before.

Data Source

The source of this dataset is news articles from different newspapers in Turkey.

Annotations

In the development of the sentence boundary method, multiple cases were taken into account by including direct quotation sentences in news articles from newspapers that have not been discussed before. This method has been developed by taking into account many conditions, from sentences ending with numbers to sentences with quotations within quotations.

Additional Information

Dataset Curators

“Published by Can Ozbey, Ozge Dincsoy.”

Version

This dataset is taken from the cd6e457 commit of the repository

Citation Information

Please cite the following paper if you found this dataset useful:

Özbey, C., and Dinçsoy, Ö. (2019). Sentence Boundary Detection in Turkish News with Regular Expressions. In 2019 IEEE 27th Signal Processing and Communications Applications Conference (SIU).

@inproceedings{inproceedings,
    title={TSentence Boundary Detection in Turkish News with
Regular Expressions},
    author={Can Ozbey, Ozge Dincsoy},
    year={2020},
    month={aug},
    isbn={978-1-7281-1904-5},
    doi={10.1109/SIU.2019.8806556}
}
``

{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}