Christian Thomas

@cthomas

Joined on Dec 5, 2018

  • The Deutsches Textarchiv ("German Text Archive", DTA), hosted by the CLARIN service center at the Berlin-Brandenburg Academy of Sciences and Humanities, is the largest single corpus of historical New High German covering the 16^th^ through the early 20^th^ century, comprising more than 350 million tokens in 1.34 million digitized pages. Focussing mostly on (digitized) printed material, the DTA also includes a growing number of hand-written documents. Specialty subcorpora include historical newspapers and other periodicals. The DTA as a whole covers a rich variety of text in the genres belles-lettres, use-literature, and academic writing. Fig. 1: Deutsches Textarchiv / German Text Archive landing page, http://www.deutschestextarchiv.de/. The DTA is composed of the so-called DTA-Kernkorpus (DTAK, "DTA Core Corpus") with ca. 1500 first editions from the 16^th^ through the 19^th^ century. In this time frame, the Core Corpus is balanced with respect to text genres and token counts. Additionally, the DTA-Erweiterungen (DTAE, "DTA Extensions") module contains specialty corpora and individual texts which have been curated in the context of CLARIN-D and other projects. The full-text sources provided by digitization projects and other discipline-specific initiatives have been (manually or semi-automatically) converted to a TEI-compatible XML format conforming to the DTA-Basisformat (DTABf, "DTA Base Format") guidelines, including extensive metadata on the original sources and data preparation. OCR texts in the DTA Core Corpus – as well as numerous additional text resources – have been manually corrected. A continuous quality assurance process is made possible by the collaborative web-based platform DTAQ, with ca. 2000 currently registered users. All DTA corpora are prepared for user consumption by automated computational linguistic analysis methods, including not only PoS-tagging and lemmatization, but also – among others – an orthographic normalization of historical spelling variants, allowing users to formulate queries in modern orthography. Each individual document – and the corpus as a whole – is available for download in a variety of XML formats (TEI P5 with or without TEI:att.linguistic attributes, TCF, and HTML) and as plain text. Metadata are available as a TEI-header, CMDI, or Dublin Core, and an API is provided for automated harvesting. Additional tools are provided for statistical analysis of the corpora, including time series plots and diachronic collocation analysis with the help of the software tool DiaCollo. Fig. 2: The Deutsches Textarchiv / German Text Archive: an integrated research platform; Illustration from: Geyken et al. 2018, p. 221
     Like  Bookmark
  • DTA Description (en) The Deutsches Textarchiv ("German Text Archive", DTA) is the largest single corpus of historical New High German covering the 16^th^ through the early 20^th^ century, comprising more than 350 million tokens in 1.34 million digitized pages. Focussing mostly on (digitized) printed material, the DTA also includes a growing number of hand-written documents. Specialty subcorpora include historical newspapers and other periodicals. The DTA as a whole covers a rich variety of text in the genres belles-lettres, use-literature, and academic writing. Fig. 1: Deutsches Textarchiv / German Text Archive landing page, http://www.deutschestextarchiv.de/. The DTA is composed of the so-called DTA-Kernkorpus (DTAK, "DTA Core Corpus") with ca. 1500 first editions from the 16^th^ through the 19^th^ century. In this time frame, the Core Corpus is balanced with respect to text genres and token counts. Additionally, the DTA-Erweiterungen (DTAE, "DTA Extensions") module contains specialty corpora and individual texts which have been curated in the context of CLARIN-D and other projects. The full-text sources provided by digitization projects and other discipline-specific initiatives have been (manually or semi-automatically) converted to a TEI-compatible XML format conforming to the DTA-Basisformat (DTABf, "DTA Base Format") guidelines, including extensive metadata on the original sources and data preparation. OCR texts in the DTA Core Corpus – as well as numerous additional text resources – have been manually corrected. A continuous quality assurance process is made possible by the collaborative web-based platform DTAQ, with ca. 2000 currently registered users. All DTA corpora are prepared for user consumption by automated computational linguistic analysis methods, including not only PoS-tagging and lemmatization, but also – among others – an orthographic normalization of historical spelling variants, allowing users to formulate queries in modern orthography. Each individual document – and the corpus as a whole – is available for download in a variety of XML formats (TEI P5 with or without TEI:att.linguistic attributes, TCF, and HTML) and as plain text. Metadata are available as a TEI-header, CMDI, or Dublin Core, and an API is provided for automated harvesting. Additional tools are provided for statistical analysis of the corpora, including time series plots and diachonic collocation analysis with the help of the software tool DiaCollo. Fig. 2: The Deutsches Textarchiv / German Text Archive: an integrated research platform; Illustration from: Geyken et al. 2018, p. 221
     Like  Bookmark
  • Info/Blogbeitrag für CLARIN ERIC New print publication is based on the electronic edition in the German Text Archive, using tools and supplementary data from within the CLARIN infrastructure Book: Alexander von Humboldt, Henriette Kohlrausch: Die Kosmos-Vorlesung an der Berliner Sing-Akademie. Edited by Christian Kassung and Christian Thomas. Berlin: Insel Verlag, 2019. (insel taschenbuch 4719, ISBN 978-3-458-36419-1) Publisher's landing page: https://www.suhrkamp.de/buecher/die_kosmos-vortraege-alexander_von_humboldt_36419.html. Cover insel taschenbuch 4719, © Insel Verlag Berlin. Background: Alexander von Humboldt's legendary 'Kosmos-Lectures'
     Like  Bookmark