changed 5 years ago
Published Linked with GitHub

Seeing text as data


Example: the Viral Texts project

Image source: Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org.
Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org

Image source: Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org.
Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2017), http://viraltexts.org


Image source: The Library of Congress, Chronicling America


Example: Quantifying Kissinger project


https://vimeo.com/100791450
Credit:
Micki Kaufman,
Quantifying Kissinger project


Example: Visualizing Christianity and Islam in the Mediterranean



Some problems with text (as) data

  • we focus on end results or primary sources. We often tend to ignore the process that (computationally) produces certain knowledge sources, resouces, results

  • texts and data are messy. They are often problematic, fragmented, inconsistent right from the start.They require significant personal labor and time investment in cleaning, accessibility, curation, maintenance

No individual scholar can read and proofread each text, so the texts we use will have errors, from small typos to missing chapters, which may cause problems in the aggregate. Ideally, to address this issue, scholars could create a large, collaboratively edited collection of plain-text versions of literary works that would be open access."

Swafford, Joanna. “‘49. Messy Data and Faulty Tools | Joanna Swafford’ in ‘Debates in the Digital Humanities 2016’ on Manifold.” In Debates in the Digital Humanities. Accessed October 13, 2020. https://dhdebates.gc.cuny.edu/read/untitled/section/7e0afe14-e266-4359-aa4a-5dff02735e8b#ch49.


  • text datasets need to be interrogated. Instead of using computational analysis to reinforce patterns and preconceived ideas we can use analysis and visualization to reveal omissions, exclusions, assumption in our data

  • text analysis tools need to be approached critically. Some tools come with their own assumptions and conditions. Also, tools have their constraints and they often require considerable time in testing, training, troubleshooting, finetuning.

How to see texts as data?


Distant reading

Term (coined by Franco Moretti)referring to the use of computational methods to analyze literary texts.

"I have chosen distant reading because the phrase underlines the macroscopic scale of recent literary-historical experiments, without narrowly specifying theoretical presuppositions, methods, or objects of analysis."

Ted Underwood,2017.


Text mining

The automatic extraction of information from different textual resources with the purpose of revealing patterns, dimensions and relations through computational analysis.


Topic modeling

Topic modeling is a method of analyzing corpora by identifying semantic classes of words ("topics") and using them as the basis for analysis. It uses algorithms to break down collections of documents into a range of topics that can be used to explore and analyze the text.

Select a repo