# Seeing text as data Depending on our disciplinary and methodological perspective and based on whether we work with primary sources or with processed and visualized text-based data we miss part of the picture. We often see "ready-for-consumption" data while ignoring the software, methods, and perspectives that produced it or we focus on the texts and primary sources themselves without thinking too much about the process through which they are made digitally visible and accessible, transformed - and potentially distorted along the way- into machine-readable formats. ## Example: the Viral Texts project ![Image source: Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org.](https://i.imgur.com/JoVrtUI.png) Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org ![Image source: Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org.](https://i.imgur.com/0Db43Gz.png) Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org ![Image source: The Library of Congress, Chronicling America](https://i.imgur.com/InVHFTs.png) ## Example: Quantifying Kissinger project {%vimeo 100791450 %} ![](https://i.imgur.com/0XjvTgS.png) https://vimeo.com/100791450 Credit: Micki Kaufman, [Quantifying Kissinger project ](https://blog.quantifyingkissinger.com/) ## Example: Visualizing Christianity and Islam in the Mediterranean ![](https://i.imgur.com/VsyKXZE.png) ![](https://i.imgur.com/dRaYR2Q.png) ## Some problems with text (as) data - we focus on end results or primary sources. We often tend to ignore the process that (computationally) produces certain knowledge sources, resouces, results - texts and data are messy. They are often problematic, fragmented, inconsistent right from the start.They require significant personal labor and time investment in cleaning, accessibility, curation, maintenance > "This is not always true for large-scale text analysis projects: HathiTrust, Project Gutenberg, and the Internet Archive have a plethora of works in plain-text format, but the quality of the optical character recognition (OCR) can be unreliable.1 No individual scholar can read and proofread each text, so the texts we use will have errors, from small typos to missing chapters, which may cause problems in the aggregate.2 Ideally, to address this issue, scholars could create a large, collaboratively edited collection of plain-text versions of literary works that would be open access. The Eighteenth-Century Collections Online Text Creation Partnership,3 the Early Modern OCR Project,4 and the Corpus of Historical American English5 have helpfully created repositories of texts, through both manual entry and automated OCR correction, but they still represent a comparatively small portion of all texts online." Swafford, Joanna. “‘49. Messy Data and Faulty Tools | Joanna Swafford’ in ‘Debates in the Digital Humanities 2016’ on Manifold.” In Debates in the Digital Humanities. Accessed October 13, 2020. https://dhdebates.gc.cuny.edu/read/untitled/section/7e0afe14-e266-4359-aa4a-5dff02735e8b#ch49. - text datasets need to be interrogated. Instead of using computational analysis to reinforce patterns and preconceived ideas we can use analysis and visualization to reveal omissions, exclusions, assumption in our data - text analysis tools need to be approached critically. Some tools come with their own assumptions and conditions. Also, tools have their constraints and they often require considerable time in testing, training, troubleshooting, finetuning. ## How to see texts as data? ### Distant reading Term (coined by Franco Moretti)referring to the use of computational methods to analyze literary texts. > "I have chosen distant reading because the phrase underlines the macroscopic scale of recent literary-historical experiments, without narrowly specifying theoretical presuppositions, methods, or objects of analysis." Ted Underwood,2017. ### Text mining The automatic extraction of information from different textual resources with the purpose of revealing patterns, dimensions and relations through computational analysis. ### Topic modeling Topic modeling is a method of analyzing corpora by identifying semantic classes of words ("topics") and using them as the basis for analysis. It uses algorithms to break down collections of documents into a range of topics that can be used to explore and analyze the text.