Deutsches Textarchiv / German Text Archive: A short introduction === The [*Deutsches Textarchiv* ("German Text Archive", DTA)](http://www.deutschestextarchiv.de/), hosted by the [CLARIN service center](https://clarin.bbaw.de/en/) at the [Berlin-Brandenburg Academy of Sciences and Humanities](http://www.bbaw.de/), is the largest single corpus of historical New High German covering the 16^th^ through the early 20^th^ century, comprising more than 350 million tokens in 1.34 million digitized pages. Focussing mostly on (digitized) printed material, the DTA also includes a growing number of hand-written documents. Specialty subcorpora include historical newspapers and other periodicals. The DTA as a whole covers a rich variety of text in the genres belles-lettres, use-literature, and academic writing. ![DTA landing page](https://i.imgur.com/iU6JraW.png) Fig. 1: *Deutsches Textarchiv* / German Text Archive landing page, [http://www.deutschestextarchiv.de/](http://www.deutschestextarchiv.de/). The DTA is composed of the so-called [*DTA-Kernkorpus* (DTAK, "DTA Core Corpus")](http://www.deutschestextarchiv.de/doku/ueberblick#dta-kernkorpus) with ca.&nbsp;1500 first editions from the 16^th^ through the 19^th^ century. In this time frame, the Core Corpus is balanced with respect to text genres and token counts. Additionally, the [*DTA-Erweiterungen* (DTAE, "DTA Extensions")](http://www.deutschestextarchiv.de/dtae) module contains specialty corpora and individual texts which have been curated in the context of [CLARIN-D](https://www.clarin-d.net/en/) and other projects. The full-text sources provided by digitization projects and other discipline-specific initiatives have been (manually or semi-automatically) converted to a [TEI](https://tei-c.org)-compatible XML format conforming to the [*DTA-Basisformat* (DTABf, "DTA Base Format")](http://www.deutschestextarchiv.de/doku/basisformat/) guidelines, including extensive metadata on the original sources and data preparation. OCR texts in the DTA Core Corpus – as well as numerous additional text resources – have been manually corrected. A continuous quality assurance process is made possible by the collaborative web-based platform [DTAQ](http://www.deutschestextarchiv.de/dtaq/about), with ca.&nbsp;2000 currently registered users. All DTA corpora are prepared for user consumption by automated computational linguistic analysis methods, including not only PoS-tagging and lemmatization, but also – among others – an orthographic normalization of historical spelling variants, allowing users to formulate queries in modern orthography. Each individual document – and the corpus as a whole – is available for download in a variety of XML formats (TEI P5 with or without [TEI:att.linguistic](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html) attributes, [TCF](https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format), and HTML) and as plain text. Metadata are available as a TEI-header, [CMDI](https://www.clarin.eu/content/component-metadata), or Dublin Core, and an [API](https://clarin.bbaw.de/oai-dta/?verb=Identify) is provided for automated harvesting. Additional tools are provided for statistical analysis of the corpora, including [time series plots](http://www.deutschestextarchiv.de/search/plot/) and diachronic collocation analysis with the help of the software tool [DiaCollo](https://clarin-d.de/en/diacollo-en). ![](http://kaskade.dwds.de/~haaf/img/ForschungsplattformDTA.png) Fig. 2: The *Deutsches Textarchiv* / German Text Archive: an integrated research platform; Illustration from: [Geyken et al. 2018](https://doi.org/10.1515/9783110538663-011), p. 221 The DTA is fully integrated into the [CLARIN](https://www.clarin.eu/) infrastructure (e.g. via [VLO](https://www.clarin.eu/vlo), [FCS](https://www.clarin.eu/content/content-search), [LRS](https://www.clarin.eu/content/language-resource-switchboard), and [WebLicht](https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page)). Long-term availability, persistent addressability, and versioning of the data are provided by the [CLARIN Repository](https://clarin.bbaw.de/en/) of the *Zentrum Sprache* at the BBAW. Furthermore, the DTA serves as the basis for consultation and instruction in the context of CLARIN-D with respect to the associated tools, workflows, and procedures. --- ## Bibliography <!-- für CLARIN ERIC, daher nur Englisch und mit starkem CLARIN-Bezug ausgewählt. --> * Boenig, Matthias, and Susanne Haaf (2019): "Aggregating resources in CLARIN: FAIR corpora of historical newspapers in the German Text Archive." In: *Proceedings of CLARIN Annual Conference 2019*, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 124–128. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings.pdf. * Fischer, Frank, Susanne Haaf, and Marius Hug (2019): "The best of three worlds: Mutual enhancement of corpora of dramatic texts (GerDraCor, German Text Archive, TextGrid Repository)." In: *Proceedings of CLARIN Annual Conference 2019*, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 97–103. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings.pdf. * Jurish, Bryan, and Maret Nieländer (2019): "Using DiaCollo for historical research." In: *Proceedings of CLARIN Annual Conference 2019*, Kiril Simov and Maria Eskevich (eds.), Leipzig: CLARIN, 40–43. PDF available at: https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings.pdf. * Geyken, Alexander, Matthias Boenig, Susanne Haaf, Bryan Jurish, Christian Thomas, and Frank Wiegand (2018): "Das Deutsche Textarchiv als Forschungsplattform für historische Daten in CLARIN." In: Henning Lobin, Roman Schneider, and Andreas Witt (eds.): *Digitale Infrastrukturen für die germanistische Forschung* (= *Germanistische Sprachwissenschaft um 2020*, vol. 6). Berlin/Boston, 2018, 219–248. DOI: [https://doi.org/10.1515/9783110538663](http://dx.doi.org/10.1515/9783110538663-011). * Bański, Piotr, Susanne Haaf, and Martin Mueller (2018): "Lightweight Grammatical Annotation in the TEI: New Perspectives." In: *Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)*, 7.–12. Mai 2018, Miyazaki (Jp), 1795–1802. PDF available at: http://www.lrec-conf.org/proceedings/lrec2018/pdf/422.pdf. * Haaf, Susanne, and Christian Thomas (2017): "Enabling the Encoding of Manuscripts within the DTABf. Extension and Modularization of the Format." In: *Journal of the Text Encoding Initiative* (jTEI), 10: 2015 Conference Issue. DOI: [https://doi.org/10.4000/jtei.1650](https://doi.org/10.4000/jtei.1650). * Geyken, Alexander, and Thomas Gloning (2015): "A living text archive of 15th–19th-century German. Corpus strategies, technology, organization." In: Jost Gippert and Ralf Gehrke (eds.): *Historical Corpora. Challenges and Perspectives*. Tübingen 2015, 165–180. PDF available at: http://www.deutschestextarchiv.de/files/Geyken-Gloning-2015_A-living-text-archive_CLIP-5_2018-07-05.pdf. * Jurish, Bryan (2015): "DiaCollo: On the trail of diachronic collocations." In: Koenraad De Smedt (ed.): *Proceedings of the CLARIN Annual Conference 2015*, Wroclaw, Poland, Ocotber 14–17, 28–31. PDF available at: http://www.deutschestextarchiv.de/files/jurish2015diacollo-clarin.pdf. * Thomas, Christian, and Frank Wiegand (2015): "Making great work even better. Appraisal and digital curation of widely dispersed electronic textual resources (c. 15th–19th centuries) in CLARIN-D." In: Jost Gippert and Ralf Gehrke (eds.): *Historical Corpora. Challenges and Perspectives*. Tübingen 2015, 181–196. PDF available at: http://www.deutschestextarchiv.de/files/Thomas-Wiegand-2015_Making-Great-Work-Even-Better_CLIP-5_2018-07-05.pdf.