Klexikon - HackMD

# Klexikon ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset_structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction The Klexikon dataset is a German resource of document-aligned texts between German Wikipedia and the children's lexicon "Klexikon". The dataset was created for the purpose of joint text simplification and summarization, and contains almost 2900 aligned article pairs. Notably, the children's articles use a simpler language than the original Wikipedia articles; this is in addition to a clear length discrepancy between the source (Wikipedia) and target (Klexikon) domain. ## Dataset Structure ### Data Instances One datapoint represents the Wikipedia text (wiki_text), as well as the Klexikon text (klexikon_text). Sentences are separated by newlines for both datasets, and section headings are indicated by leading == (or === for subheadings, ==== for sub-subheading, etc.). Further, it includes the wiki_url and klexikon_url, pointing to the respective source texts. Note that the original articles were extracted in April 2021, so re-crawling the texts yourself will likely change some content. Lastly, we include a unique identifier u_id as well as the page title title of the Klexikon page. Sample (abridged texts for clarity): ``` { "u_id": 0, "title": "ABBA", "wiki_url": "https://de.wikipedia.org/wiki/ABBA", "klexikon_url": "https://klexikon.zum.de/wiki/ABBA", "wiki_sentences": [ "ABBA ist eine schwedische Popgruppe, die aus den damaligen Paaren Agnetha Fältskog und Björn Ulvaeus sowie Benny Andersson und Anni-Frid Lyngstad besteht und sich 1972 in Stockholm formierte.", "Sie gehört mit rund 400 Millionen verkauften Tonträgern zu den erfolgreichsten Bands der Musikgeschichte.", "Bis in die 1970er Jahre hatte es keine andere Band aus Schweden oder Skandinavien gegeben, der vergleichbare Erfolge gelungen waren.", "Trotz amerikanischer und britischer Dominanz im Musikgeschäft gelang der Band ein internationaler Durchbruch.", "Sie hat die Geschichte der Popmusik mitgeprägt.", "Zu ihren bekanntesten Songs zählen Mamma Mia, Dancing Queen und The Winner Takes It All.", "1982 beendeten die Gruppenmitglieder aufgrund privater Differenzen ihre musikalische Zusammenarbeit.", "Seit 2016 arbeiten die vier Musiker wieder zusammen an neuer Musik, die 2021 erscheinen soll.", ], "klexikon_sentences": [ "ABBA war eine Musikgruppe aus Schweden.", "Ihre Musikrichtung war die Popmusik.", "Der Name entstand aus den Anfangsbuchstaben der Vornamen der Mitglieder, Agnetha, Björn, Benny und Anni-Frid.", "Benny Andersson und Björn Ulvaeus, die beiden Männer, schrieben die Lieder und spielten Klavier und Gitarre.", "Anni-Frid Lyngstad und Agnetha Fältskog sangen." ] }, ``` ### Data Fields * `u_id` (int): A unique identifier for each document pair in the dataset. 0-2349 are reserved for training data, 2350-2623 for testing, and 2364-2897 for validation. * `title` (str): Title of the Klexikon page for this sample. * `wiki_url` (str): URL of the associated Wikipedia article. Notably, this is non-trivial, since we potentially have disambiguated pages, where the Wikipedia title is not exactly the same as the Klexikon one. * `klexikon_url`(str): URL of the Klexikon article. * `wiki_text` (List[str]): List of sentences of the Wikipedia article. They prepare a pre-split document with spacy's sentence splitting (model: de_core_news_md). Additionally, please note that we do not include page contents outside of < p > tags, which excludes lists, captions and images. * `klexikon_text` (List[str]): List of sentences of the Klexikon article. They apply the same processing as for the Wikipedia texts. ### Data Splits They provide a stratified split of the dataset, based on the length of the respective Wiki article/Klexikon article pair (according to number of sentences). The x-axis represents the length of the Wikipedia article, and the y-axis the length of the Klexikon article. They segment the coordinate systems into rectangles of shape (100, 10), and randomly sample a split of 80/10/10 for training/validation/test from each rectangle to ensure stratification. In case of rectangles with less than 10 entries, we put all samples into training. The final splits have the following size: * 2350 samples for training * 274 samples for validation * 274 samples for testing ## Reference We would like to acknowledge Aumiller, Dennis et al. for creating and maintaining the Klexikon dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Klexikon dataset and its creator, please visit [the Klexikon website](https://github.com/dennlinger/klexikon). ## License The dataset has been released under the Creative Commons Attribution-ShareAlike 4.0 International License. ## Citation ``` @inproceedings{aumiller-gertz-2022-klexikon, title = "Klexikon: A {G}erman Dataset for Joint Summarization and Simplification", author = "Aumiller, Dennis and Gertz, Michael", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.288", pages = "2693--2701" } ```