# SplitExp A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression. github repo of dataset: https://github.com/ideateknoloji/SplitExp ## Dataset Details This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case. cases of end-of-sentence markers: | case | percentage | |--------|-----------| |quotation | %21.2 | | numbers | %8.7 | | abbreviations | %2.2 | | extensions | %0.4 | ### Samples Samples of data instances from all types of data present in the dataset. Example: ``` {"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]} ``` ### Fields Explain the fields of the instances. | field | dtype | |----------|------------| | id | id of the token | | text | token | | indexes | ? | | types | ? | ### Splits Experiments were conducted on 9343 end-of-sentence markers obtained from 685 unambiguous documents by means of a marking tool developed for testing. ## Dataset Creation ### Curation Rationale The motivation of this dataset is that it aims to develop a method for determining the boundary of sentences for news texts by including direct quotations that have not been addressed before. ### Data Source The source of this dataset is news articles from different newspapers in Turkey. ### Annotations In the development of the sentence boundary method, multiple cases were taken into account by including direct quotation sentences in news articles from newspapers that have not been discussed before. This method has been developed by taking into account many conditions, from sentences ending with numbers to sentences with quotations within quotations. ## Additional Information ### Dataset Curators “Published by Can Ozbey, Ozge Dincsoy.” ### Version This dataset is taken from the cd6e457 commit of the repository ### Citation Information Please cite the following paper if you found this dataset useful: Özbey, C., and Dinçsoy, Ö. (2019). Sentence Boundary Detection in Turkish News with Regular Expressions. In 2019 IEEE 27th Signal Processing and Communications Applications Conference (SIU). ``` @inproceedings{inproceedings, title={TSentence Boundary Detection in Turkish News with Regular Expressions}, author={Can Ozbey, Ozge Dincsoy}, year={2020}, month={aug}, isbn={978-1-7281-1904-5}, doi={10.1109/SIU.2019.8806556} } `` {"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}