Try   HackMD

A (very) brief introduction to topic modeling


What is topic modeling?


"A statistical representation of topics within a textual corpus of documents is referred to as topic models. Typically, each topic is represented by a set of related lemmas; these are often accompanied by weights to indicate the relative prominence of each lemma within a topic.

Arnold, Taylor, and Tilton, Lauren . 2015. Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text.


Topic modeling provides a suite of algorithms to discover hidden thematic structure in large collections of texts. The results of topic modeling algorithms can be used to summarize, visualize, explore, and theorize about a corpus.

Blei, David M. 2012. “Topic Modeling and Digital Humanities Journal of Digital Humanities.”


Some examples of topic modeling

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

A visualization of the topic modeling results of a 350 document Environmental Humanities corpus.
Source:
Digital Environmental Humanities
https://dig-eh.org/dig-eh/TopicModelling/CircularDisciplines/


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

A visualization of co-occurring words with LANDSCAPE in topic models
Source: https://dig-eh.org/dig-eh/TopicModelling/Landscape/#Landscape
Digital Environmental Humanities


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Exploratory topic modeling of Frederick W. Hasluck's 1929 "Christianity and Islam Under the Sultans.
Project: Visual Hasluck


How is topic modeling useful?

Topic modeling is great for:

  • discovering hidden themes, patterns, word clusters beyond a basic keyword search
  • grouping / categorizing multiple documents / corpora based on identified "topics"
  • exploring / better understanding your dataset

Some problems with topic modeling

Topic modeling is perfectly designed for workshops and demonstrations, since you don’t have to start with a specific research question. A group of people with different interests can just pour a collection of texts into the computer, gather round, and see what patterns emerge. Generally, interesting patterns do emerge: topic modeling can be a powerful tool for discovery. But it would be a mistake to take this workflow as paradigmatic for text analysis. Usually researchers begin with specific research questions, and for that reason I suspect we’re often going to prefer supervised models.

Underwood, Ted. 2015. “Seven Ways Humanists Are Using Computers to Understand Text.” The Stone and the Shell (blog). June 4, 2015.


Algorithms do not interpret their own failures, but their errors generate moments of rupture for the feminist literary scholar, in whose hands error and marginality expose the fault lines encoded in predictive methods such as classifying and organizing text. () I use large poetic corpora to interrogate the assumptions of topic modeling—that documents sharing similar words likewise share thematic coherence. Where most topic-modeling results demonstrate thematic coherence, mine represent discourse coherence; subsequently, the model points to ekphrastic poems by women who share similar discourses as their male counterparts, but do so ironically.

Rhody, Lisa Marie. 2016. “‘46. Why I Dig: Feminist Approaches to Text Analysis |


How can I try topic modeling with my text(s)?

Topic modeling tools online

Voyant Tools

Task:
  • Go to https://voyant-tools.org/
  • Click on "Upload"
  • Choose 10-20 plain text files from the Wikipedia movie summaries dataset and upload them
  • On Voyant Tools's main panel click on the "Click to choose another tool for this panel location" icon and select "Corpus Tools" > Topics
  • Copy the results or download an image.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Repeat the steps above but this time select "Vizualization Tools" > "Links".

What are some of the topics identified? Did you discover any meaningful semantic links or clusters?

Task:
  • Start again at https://voyant-tools.org/
  • Click on "Upload"
  • Choose the file "EN_1884_Twain,Mark_TheAdventuresofHuckleberryFinn_Novel.txt" from the "txtlab_Novel450" dataset.
  • Visualize the text using the "Topics" and "Mandala" tools and following the steps in the previous task.
  • Copy the results or download an image.

Jstor's TextAnalyzer

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Drag and drop "EN_1884_Twain,Mark_TheAdventuresofHuckleberryFinn_Novel.txt" from the "txtlab_Novel450" dataset.
  • Once the file is processed and analyzed review the following sections on the left menu:
    "Prioritized Terms", "Topics", "People", "Locations."

What kind of differences do you see in the analysis, structuring and visualization results between Voyant Tools and Text Analyzer?


Mallet

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • open source, Java machine learning tool
  • used for building /training topic models
  • command line operated

GUI component available here:
https://github.com/senderle/topic-modeling-tool

A great introduction to MALLET by Shawn Graham, Scott Weingart, and Ian Milligan (2017)
https://programminghistorian.org/en/lessons/topic-modeling-and-mallet

R

R is a programming language and free software environment for statistical computing
https://www.r-project.org/
![Humanities Data in R book]

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Python

Programming language

# count-list-items-1.py

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

wordlist = wordstring.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))

An example of counting word frequencies in Python from
the tutorial "Counting Word Frequencies with Python" by
William J. Turkel and Adam Crymble.
https://programminghistorian.org/en/lessons/counting-frequencies