Studio 11: Text

--- title: "Studio 11: Text" layout: post label: studio geometry: margin=2cm --- # CS 100: Studio 11 ### Text Analysis ##### November 30, 2022 ### Instructions Today, you will be analyzing the transcripts of all the 2016 US Presidential Primary Debates. You will be exploring the text to detect patterns in the language used by the candidates. First, you will look at all the candidates individually, and then you will separate them by party to see if any patterns emerge that distinguish the two parties. Upon completion of all tasks, a TA will give you credit for today's studio. If you do not manage to complete all the assigned work during the studio period, do not worry. You can continue to work on this assignment until Sunday, December 4 at 7 PM. Come by TA hours any time before then to show us your completed work and get credit for today's studio. ### Objectives By the end of this studio, you will know: - How to calculate word frequencies and perform sentiment analysis in R - How to visualize text data using ggplot2 and word clouds ### R Packages Please begin by installing (if you haven’t already), and then loading, the following packages: 1. For data wrangling: `dplyr`, `tidyr`, `stringr`. 2. For text analysis: `tidytext`. 3. For visualization: `wordcloud`. ### The Data Download the data [here](http://cs.brown.edu/courses/cs100/studios/data/10/primary_debates.csv). The source and documentation for the data can be found [here](https://www.kaggle.com/kinguistics/2016-us-presidential-primary-debates). ### Preparing the Data Begin by reading in the data: ~~~ raw_debates <- read.csv('http://www.cs.brown.edu/courses/cs100/studios/data/11/primary_debates.csv', stringsAsFactors = FALSE) ~~~ View the data. You’ll see that the data contain all the utterances said by the debate candidates, moderators, and audiences. We are interested only in what was said by the candidates. Discuss with your partner how you might restrict the data frame to only these utterances? Our solution appears below: ~~~ candidates <- c('Bush', 'Carson', 'Chafee', 'Christie', 'Clinton', 'Cruz', 'Fiorina', 'Gilmore', 'Graham', 'Huckabee', 'Jindal', 'Kasich', 'O\'Malley', 'Pataki', 'Paul', 'Perry', 'Rubio', 'Sanders', 'Santorum', 'Trump', 'Walker', 'Webb') debates <- raw_debates %>% filter(Speaker %in% candidates) ~~~ Note that O’Malley’s name is entered as 'O\'Malley'. The backtick is a special character in R, used to delimit strings. We tell RStudio not to read the backtick in O’Malley’s name as a backtick by inserting a backslash in front of it. How many speakers are there? And how many parties? It turns out there are three different values for "Party": "Democratic," "Republican," and "Republican Undercard." The "Republican Undercard" debates were held for the less popular Republican candidates. In this studio, we will consider "Republican" and "Republican Undercard" to be one party. Replace all instances of "Republican Undercard" with "Republican". For our initial analysis, we are only concerned with the "Party" and "Text" variables. Create a new data frame with only these two variables: ~~~ debate_text <- debates %>% select(Party, Text) ~~~ ### Tidy Text To conduct today’s text analysis, you’ll be using the `tidytext` package. The creators of this package define the [tidy text format](http://tidytextmining.com/tidytext.html) as a table with **one-token-per-row**. Once your data are in this format, you can use various functions in the package to summarize the characteristics of your text easily. *Note:* After this studio, if you would like to learn more about `tidytext`, check out [this](http://tidytextmining.com/) helpful website! For this first part of the analysis, we will construct a **unigram model**. In other words, we will consider one token to be one word, so one-*word*-per-row in tidy text format. But the current table consists of one-*utterance*-per-row. We thus need to **tokenize** the utterances, meaning split them up into separate words. To do so, you can use the `tidytext` function `unnest_tokens`. Run `help(unnest_tokens)` to see the documentation for this function. You’ll see that it allows you to split a column into tokens, exactly as desired. Let’s stick with the default option of considering each word to be a token. Invoke the `unnest_tokens` function and then check that `debate_words` indeed consists of one-word-per-row. ~~~ debate_words <- debate_text %>% unnest_tokens(word, Text) ~~~ Observe that `unnest_tokens` also removes all punctuation and converts all tokens (words, in our case) to lowercase. ### Word Frequencies Now that the data are in the tidy text format, we can start analyzing! Let’s begin by calculating the frequencies of the words said by all the candidates. Run the code below to tally the occurrences of each word, and then sort the words in descending order: ~~~ debate_words %>% count(word, sort = TRUE) ~~~ *Note:* `count` is shorthand for `group_by` followed by `tally`, and `tally` is shorthand for `summarize`. Take a look at the most common words. They are "the," "to," "and", etc. Unfortunately, these words aren’t providing us with any useful information! These insignificant words are what we call **stop words**, since they carry little semantic meaning. Let’s filter out the stop words, so that we are only left with more meaningful words. The `tidytext` package preloads a data set called `stop_words`. All the words in this table can be found [here](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop). To remove all the words in `stop_words` from your data, you can perform an `anti_join`. When applied to two data frames *X* and *Y*, `anti_join(X, Y)` returns a new data frame with all of the rows in X that are not in Y. If you’re familiar with set theory, this operation is equivalent to the *set difference* operation. Let’s `anti_join` with `stop_words` and then recompute word counts: ~~~ word_count <- debate_words %>% anti_join(stop_words) %>% count(word, sort = TRUE) ~~~ **Question #1: What are the five most common (non-stop) words said by the candidates?** ### Visualizing Word Frequencies Next, let’s visualize some of the most common words said by the candidates. We’ll first create a bar plot using `ggplot`. Recall that `geom_bar` specifies that you want a barchart, `coord_flip()` creates a horizontal barchart. Create a bar chart depicting the 20 most common words: ~~~ head(word_count, 20) %>% ggplot(aes(x = word, y = n)) + geom_bar(stat = "identity") + coord_flip() ~~~ The chart would be easier to read if the words were sorted in order of their frequencies. You can achieve this effect by using the `reorder` option in `aes`: ~~~ head(word_count, 20) %>% ggplot(aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip() ~~~ Next, let’s create a word cloud of the 100 most common words. In the word cloud, the more common the word is, the larger the word will appear. The `with` function in R applies an expression to a dataset. In this case, let’s use `with` to apply the wordcloud expression to `word_count`. ~~~ word_count %>% with(wordcloud(word, n, max.words = 100)) ~~~ There are several ways to customize a word cloud. We’ll introduce you to a few: Place the most common words are in the center of the cloud: ~~~ word_count %>% with(wordcloud(word, n, max.words = 100, random.order = FALSE)) ~~~ Change the color of the text: ~~~ word_count %>% with(wordcloud(word, n, max.words = 100, random.order = FALSE, colors = 'darkcyan')) ~~~ Use a color palette so that different words have different colors. This cloud uses the 8 colors from the `RColorBrewer` palette `Dark2`. Check out [this cheatsheet](http://bc.bojanorama.pl/wp-content/uploads/2013/04/rcolorsheet.pdf) to view the various colors and palettes you can use in R. ~~~ word_count %>% with(wordcloud(word, n, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"))) ~~~ ### Democrats vs. Republicans Now that we’ve learned some of the basics of tidytext, and some basic text visualizations, we can analyze the utterances of the two parties, in search of similarities and differences. Recalculate the word counts, this time separated by party, and save the result in a new data frame called `word_count_by_party`: ~~~ word_count_by_party <- debate_words %>% anti_join(stop_words) %>% count(Party, word, sort = TRUE) ~~~ Now filter this data frame by party, and then generate two bar charts, one for each party, of the 15 most common words said by members of each. Generate these bar charts as before, using `ggplot`. **Question #2: What words are in the top 15 for Democrats but not for Republicans? How about vice-versa?** *Hint:* If you would like to visualize both graphs side-by-side, install and load the package `gridExtra`. Then run the code below (replacing "dem_plot" and "rep_plot" with the names of your two bar charts): ~~~ grid.arrange(dem_plot, rep_plot, nrow = 1, ncol = 2) ~~~ You probably noticed that there was a decent amount of overlap between the most frequent words said by candidates in the two parties. It makes sense that words such as "people", "country", and "president" would be said frequently by all candidates. But what if we wanted to find words that highlight the differences between the concerns of the two parties? We can do so by calculating what’s called a word’s **term frequency—inverse document frequency (tf-idf)** score, which is a measure of how important a term is in a given document, relative to its appearance in a collection of documents. Tf-idf is high if a term is common in one particular document, but rare in the entire collection. Tf-idf is low if a term is common across the collection: e.g., stop words have low tf-idf scores, since they appear frequently in all documents. We will consider a corpus consisting of two "documents", one consisting of the Democrats’ text, and the other, the Republicans’ text. Then we will calculate tf-idf scores to see which terms are used specifically by one party and not the other. As you probably expected by now, R provides built-in functionality to compute tf-idf scores. In the `tidytext` library, you can use the `bind_tf_idf` function, which calculates the scores, and then adds a new column to the data frame where it stores the scores. Use the code below to first calculate the tf-idf scores for all words, and then extract the 15 words with the highest scores by party: ~~~ debates_tf_idf_top <- word_count_by_party %>% bind_tf_idf(word, Party, n) %>% arrange(desc(tf_idf)) %>% group_by(Party) %>% top_n(15) ~~~ Next, let’s visualize these results. As above, you should create a horizontal bar chart using `ggplot`, but you can also add some new parameters. * `fill = Party` colors the bars in the plot based on which party the word is associated with. * `scale_fill_manual` manually sets these colors for the two different parties, which overrides R’s default coloring scheme. * `alpha` pertains to the opacity of the bars; alpha = 1 depicts fully opaque bars. * `facet_wrap` depicts the two graphs on the same plot. ~~~ ggplot(debates_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = Party)) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_wrap(~Party, ncol = 2, scales = "free") + coord_flip() + scale_fill_manual(values = c("Democratic" = "blue", "Republican" = "red")) ~~~ **Question #3: For Democrats, what are the 5 words with the highest tf-idf values? What about for Republicans?** ### Bigram Model In our previous analysis, we tokenized our corpus by word. But we can also tokenize into consecutive sequences of words, or **n-grams**. A one-word sequence is called a *unigram*, a two-word sequence a *bigram*, a three-word sequence a *trigram*, and so on. Let’s take a look at the bigrams said by the candidates. We’ll use the `unnest_tokens` function again, but this time to tokenize our text into bigrams, like this: ~~~ debate_bigrams <- debate_text %>% unnest_tokens(bigram, Text, token = "ngrams", n = 2) ~~~ Next, take a look at the most common bigrams: ~~~ debate_bigrams %>% count(bigram, sort = TRUE) ~~~ It looks like we need to remove stop words again! But it’s a bit more complicated to remove stop words from bigrams. Our strategy will be to remove the bigram if a stop word appears as either the first or the second word in the bigram. The following code makes use of the `separate` and `unite` functions in the `tidyr` library. Recall that `separate` splits one column in a data frame into multiple columns, and that `unite` reverses that operation: i.e., it merges multiple columns into one. We use `separate` to split the bigram column into two different columns, `word1` and `word2`. Then, we filter out bigrams where either one of the words is a stop word. Finally, we use `unite` to recreate the bigrams that don’t contain stop words. ~~~ debate_bigrams <- debate_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% unite(bigram, word1, word2, sep = " ") ~~~ Now we can calculate the bigram frequencies across both parties, and visualize the most frequent ones: ~~~ bigram_count <- debate_bigrams %>% count(bigram, sort = TRUE) head(bigram_count, 20) %>% ggplot(aes(x = reorder(bigram, n), y = n)) + geom_bar(stat = "identity") + coord_flip() ~~~ **Question #4: What are the 5 most frequent bigrams?** Now let’s look at the parties separately: ~~~ bigram_count_by_party <- debate_bigrams %>% count(Party, bigram, sort = TRUE) ~~~ Finally, calculate the tf-idf for the bigrams, just as we calculated the tf-idf for words. Then use bar charts visually compare the bigrams with the highest tf-idf values for Democrats vs. those for Republicans. Refer back to the earlier part of this studio for more guidance as necessary. **Question #5: For Democrats, what are the 3 bigrams with the highest tf-idf values? How about for Republicans?** ### Sentiment Analysis [Time Permitting] In the last part of this studio, you will carry out a sentiment analysis of the debate text. You will do so by joining the debate text with a lexicon. A **lexicon** is like a dictionary, but instead of defining the words in a vocabulary, it ascribes to each word knowledge about its linguistic significance, which in the case of sentiment analysis pertains to its emotion. The `bing` lexicon in the `tidytext` library categorizes words as being either positive or negative. To view this lexicon, enter `get_sentiments("bing")`. You can use this lexicon to determine whether the candidates’ words are generally positive or negative. Run the code below to join the debate words with their sentiments, and then count the word frequencies. Note that counts are stored in a variable named "n". ~~~ debate_sentiment <- debate_words %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) ~~~ To visualize the results, we mutate "n" to be negative if the sentiment of the word is negative, and then create a bar chart: ~~~ head(debate_sentiment, 25) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) + geom_bar(alpha = 0.8, stat = "identity") + coord_flip() ~~~ **Question #6: What is the most frequently occurring positive word? What is the most frequently occurring negative word?** The `nrc` lexicon in the `textdata` library associates words with eight basic emotions—anger, fear, anticipation, trust, surprise, sadness, joy, and disgust—and two sentiments—negative and positive. To view this lexicon, install the "textdata" package, load the library, and then enter `get_sentiments("nrc")`. Let’s find out which of the candidates’ most frequent words are associated with "joy". First, extract the "joy" words from the `nrc` lexicon: ~~~ nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy") ~~~ Next, we can match these words with the words from the debate, using a **semi-join**. From the dplyr documentation: > semi_join(x, y): Return all rows from x where there are matching values in y, > keeping just columns from x. A semi join differs from an inner join because > an inner join will return one row of x for each matching row of y, where a > semi join will never duplicate rows of x. This is a filtering join. After joining, you can sort the words based on frequency: ~~~ debate_words %>% semi_join(nrc_joy) %>% count(word, sort = TRUE) ~~~ **Question #7: What is the most frequent word associated with joy?** Now take a look at some of the other emotions expressed by the candidates. **Question #8: What is the most frequent word associated with fear? How about with disgust?** ### End of Studio When you are done please call over a TA to review your work, and check you off for this studio. If you do not finish within the two hour studio period, remember to come to TA office hours to get checked off before Sunday (12/4).