# Half-truths | Truth, Fiction and Conspiracy in the 'Post-Factual Age' German: Halb-Wahrheiten | Wahrheit, Fiktion und Konspiration im 'postfaktischen Zeitalter' ### Introduction of the research project >Name: Ying Ying Lee Instructor: Prof. Dr. Cornelius Puschmann Profession: computational linguistics Time: 01.Sep,2020 - 31.Dec,2020 Project detail: https://halbwahrheiten.philhist.unibas.ch/de/home/ (The data is in German) Hello, I am Ying Ying Lee. I worked as a research assistant in this project jointly offered by the University of Bremen, my University, and the University of Basel. The data sets used in this project are in German. The first data set, fbslerka, contains selected 12 facebook pages including "Alternative fur Deutschland AfD", "JUNGE FREIHEIT", "Dr. Frauke Petry", "Beatrix von Storch", "KOPP Online", "unzensuriert.at", "Netzfrauen", "COMPACT-Magazin", "KenFM", "RT Deutsch", "gegenargument.at" and "PI-News". Whereas the second data set, facebook2, contains selected 6 facebook pages including "alternativefuerde", "faz", "pegidaevdresden", "sz", "welt" and "zeit". The whole research is done with the programming language R and categorised into 3 parts: General setting and preprocessing, Latent Dirichlet Allocation (LDA) and Structural topic models (STM). If you have any feedback, please feel free to contact me at yingying.lee025@gmail.com. Thank you. ## General setting and preprocessing ```r= #library("readtext") library("quanteda") library("tidyverse") library("dplyr") library("janeaustenr") library("tidytext") theme_set(theme_minimal()) ``` ### Load data ```r= load("fbslerka_kommentare.RData") load("woerterbuch_quanteda2.RData") ``` ### Apply corpus to fbslerka ```r= fbslerka.kommentare.korpus <- corpus(fbslerka.kommentare, docid_field = "id", text_field = "message") fbslerka.kommentare.korpus docvars(fbslerka.kommentare.korpus) ``` ![](https://i.imgur.com/SRYTqpM.png) ### View word frequency ```r= fbslerka.kommentare.dfm.init <- fbslerka.kommentare.korpus %>% tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>% dfm(remove = c(stopwords("english"), stopwords("german"))) fbslerka.kommentare.stat <- fbslerka.kommentare.dfm.init %>% textstat_frequency() head(fbslerka.kommentare.stat, 20) ``` ![](https://i.imgur.com/T2ZZtbo.png) ### Plot word frequency ```r= ggplot(fbslerka.kommentare.stat[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) + ggtitle("20 most frequent words in fbslerka") + geom_point() + coord_flip() + labs(x = NULL, y = "Frequency") ``` ### Calculate relative frequency by page_name ```r= fbslerka.kommentare.stat2 <- textstat_frequency(fbslerka.kommentare.weight, n = 10, groups = "page_name") # plot frequencies ggplot(data = fbslerka.kommentare.stat2, aes(x = factor(nrow(fbslerka.kommentare.stat2):1), y = frequency)) + geom_point() + facet_wrap(~ group, scales = "free", ncol = 4) + coord_flip() + scale_x_discrete(breaks = nrow(fbslerka.kommentare.stat2):1, labels = fbslerka.kommentare.stat2$feature) + labs(x = NULL, y = "Relative frequency") ``` ![](https://i.imgur.com/nDgmXnN.png) ### Calculate document-feature matrix and remove sentences with less than 40 tokens Data set: fbslerka grouped by documents (742265 -> 27006 documents) ```r= docvars(fbslerka.kommentare.korpus, "monat") <- str_sub(fbslerka.kommentare.korpus$created_time, start = 1, end = 7) selected_words <- readLines(paste("./selected_words_182.txt", sep = "")) stoppwoerter <- c(selected_words, stopwords("german"), stopwords("english")) # removing punct, url and stopwords before constructing ngrams load("fbslerka.kommentare.dfm.RData") fbslerka.kommentare.dfm <- fbslerka.kommentare.korpus %>% corpus_trim(min_ntoken = 40) %>% #corpus_sample(size = 50000) %>% tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>% tokens_remove(stoppwoerter) %>% #tokens_ngrams(1:3) %>% dfm(remove = stoppwoerter, groups = "page_name") %>% dfm_select(min_nchar = 3) %>% dfm_trim(min_termfreq = 2) fbslerka.kommentare.dfm head(dfm_sort(fbslerka.kommentare.dfm, decreasing = TRUE, margin = "both"), n = 12, nf = 10) ``` ``` docs #comvsafd afd merkel frau volk politik flüchtlinge europa politiker islam Alternative für Deutschland AfD 11233 2653 1019 764 656 577 659 489 520 361 Dr. Frauke Petry 0 1324 466 608 447 304 323 308 293 321 JUNGE FREIHEIT 0 450 471 355 349 265 210 240 278 416 Beatrix von Storch 44 586 196 406 157 118 121 154 116 171 KenFM 0 66 69 84 73 92 58 73 47 56 KOPP Online 0 67 174 87 165 117 90 121 101 52 [ reached max_ndoc ... 6 more documents ] ``` ### View word frequency after cleasing fbslerka ```r= fbslerka.kommentare.stat <- fbslerka.kommentare.dfm %>% textstat_frequency() head(fbslerka.kommentare.stat, 50) ``` ![](https://i.imgur.com/HljpVtX.png) ### The second data set: facebook2 ```r= load("facebook2.RData") load("woerterbuch_quanteda2.RData") # summary(woerterbuchK) ``` ### Create list of removing words using statistical methods ```r= facebook2.korpus.dfm.init <- facebook2.korpus %>% tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>% dfm(remove = c(stopwords("english"), stopwords("german"))) facebook2.korpus.stat <- facebook2.korpus.dfm.init %>% textstat_frequency() head(facebook2.korpus.stat, 20) ``` ### Calculate relative frequency by source ```r= facebook2.korpus.stat2 <- textstat_frequency(facebook2.korpus.weight, n = 10, groups = "source") # plot frequencies ggplot(data = facebook2.korpus.stat2, aes(x = factor(nrow(facebook2.korpus.stat2):1), y = frequency)) + geom_point() + facet_wrap(~ group, scales = "free", ncol = 3) + coord_flip() + scale_x_discrete(breaks = nrow(facebook2.korpus.stat2):1, labels = facebook2.korpus.stat2$feature) + labs(x = NULL, y = "Relative frequency") ``` ![](https://i.imgur.com/JyiMc6u.png) ### Calculate document-feature matrix and remove sentences with less than 40 tokens Data set: facebook2 grouped by month (950647 -> 82426 documents) ```r= selected_words <- readLines(paste("./selected_words_182.txt", sep = "")) stoppwoerter <- c(selected_words, stopwords("german"), stopwords("english")) load("facebook2.RData") docvars(facebook2.korpus, "monat") <- str_sub(facebook2.stats$created_time, start = 1, end = 7) #save(facebook2.dfm , file = 'facebook2.dfm.RData') facebook2.dfm <- facebook2.korpus %>% corpus_trim(min_ntoken = 40) %>% #corpus_sample(size = 50000) %>% tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>% tokens_remove(stoppwoerter) %>% #tokens_ngrams(1:3) %>% dfm(remove = stoppwoerter, groups = "monat") %>% dfm_select(min_nchar = 3) %>% dfm_trim(min_termfreq = 2) facebook2.dfm head(dfm_sort(facebook2.dfm, decreasing = TRUE, margin = "both"), n = 10, nf = 20) ``` ``` time kinder frau flüchtlinge afd merkel politik frauen europa volk regierung 2016-01 269 714 575 585 601 386 789 247 340 299 2015-01 265 199 96 37 123 290 170 222 337 167 2015-09 360 316 802 60 288 214 194 391 229 205 2016-03 302 458 419 1075 478 372 293 342 247 203 2016-02 321 483 496 838 436 336 328 273 325 204 2015-11 242 292 478 72 319 248 193 327 250 329 [ reached max_ndoc ... 4 more documents, reached max_nfeat ... 10 more features ] ``` ## Latent Dirichlet Allocation (LDA) ```r= library("gsl") library("topicmodels") library("stringr") # Assign an arbitrary number of topics topic.count <- 20 # Convert DFM to a topicmodels object fbslerka.topicmodels <- convert(fbslerka.kommentare.dfm, to = "topicmodels") facebook2.topicmodels <- convert(facebook2.dfm, to = "topicmodels") ``` ### Calculate the LDA model ```r= load('fbslerka.lda.20.RData') load('facebook2.lda.20.RData') fbslerka.lda.20 <- LDA(fbslerka.topicmodels, topic.count) facebook2.lda.20 <- LDA(facebook2.topicmodels, topic.count) as.data.frame(terms(fbslerka.lda.20, 6)) as.data.frame(terms(facebook2.lda.20, 6)) ``` Result of fbslerka ```csvpreview {header="true"} Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9 wasser,österreich,petry,#comvsafd,storch,cdu,bank,#comvsafd,#comvsafd kaufen,zahlen,parteien,parteien,platz,usw,russland,cdu,letzten fleisch,österreicher,hätten,linke,rassisten,kirche,ken,umfrage,cdu nestle,grünen,altparteien,cdu,grenze,grünen,central,comvsafd,fdp tiere,wien,eigene,petry,gewalt,unseren,kenfm,sonstige, linke ``essen,asylanten,versuchen,grüne,grenzen,bevölkerung,syrien,grenzen,usw ``` 6 rows | 1-9 of 20 columns ### Terms > Topics The number values describe the probability of the association of a term with a topic, or the share of a topic in a document ```r= head(as.data.frame(t(fbslerka.lda.20@beta), row.names = fbslerka.lda.20@terms)) head(as.data.frame(t(facebook2.lda.20@beta), row.names = facebook2.lda.20@terms)) ``` ```csvpreview {header="true"} Terms,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9 weinende,-602.57, -10.60, -11.71, -533.69, -208.15, -493.37, -9.62, -11.82, -660.97 zigfachen, -693.72, -701.06, -705.97, -711.44, -670.27, -702.99, -10.24, -711.12, -716.80 sexuellen, -10.55, -11.01, -9.42, -9.98, -9.60, -9.05, -7.70, -9.51, -9.12 übergriffen, -10.55, -10.60, -12.53, -11.37, -10.30, -11.13, -8.03, -10.52, -9.21 menge, -8.47, -8.23, -9.12, -8.32, -7.84, -8.36, -8.15, -8.43, -8.26 schlägereien, -654.78, -11.70, -17.39, -538.42, -11.40, -653.08, ```-9.31, -678.33, -104.59 ``` ### Documents > Topics ```r= head(as.data.frame(fbslerka.lda.20@gamma, row.names = fbslerka.lda.20@documents)) head(as.data.frame(facebook2.lda.20@gamma, row.names = facebook2.lda.20@documents)) ``` ```csvpreview {header="true"} Documents, V1, V2,V3,V4,V5,V6,V7,V8,V9 Alternative für Deutschland (AfD), 9.106133e-08, 9.106133e-08, 7.460871e-04, 1.845340e-01, 9.106135e-08, 9.106135e-08, 9.106134e-08, 1.418453e-01, 2.997342e-01 Beatrix von Storch, 3.477672e-07, 3.477672e-07, 1.711327e-02, 3.477671e-07, 8.963264e-01, 3.477671e-07, 3.477671e-07, 3.477671e-07, 3.477672e-07 COMPACT-Magazin, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07 Dr. Frauke Petry, 1.882329e-07, 1.882329e-07, 1.533403e-01, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07 gegenargument.at, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06 JUNGE FREIHEIT, 1.995489e-07, 1.995489e-07, 1.028358e-04, 1.995490e-07, 1.995489e-07, 6.937254e-01, 1.995489e-07, 1.995490e-07, 1.995490e-07 ``` ### Plot the similarity among topics ```r= fbslerka.similarity <- as.data.frame(fbslerka.lda.20@beta) %>% scale() %>% dist(method = "euclidean") %>% hclust(method = "ward.D2") #par(mar = c(0, 4, 4, 2)) plot(fbslerka.similarity, main = "LDA topic similarity of fbslerka by features (20 topics)", xlab = "", sub = "") ``` ![](https://i.imgur.com/V2DCJra.png) ```r= facebook2.similarity <- as.data.frame(facebook2.lda.20@beta) %>% scale() %>% dist(method = "euclidean") %>% hclust(method = "ward.D2") #par(mar = c(0, 4, 4, 2)) plot(facebook2.similarity, main = "LDA topic similarity of facebook2 by features (20 topics)", xlab = "", sub = "") ``` ![](https://i.imgur.com/JMEWDeU.png) ## Structural topic models (STM) The STM allows to include metadata into the topicmodel and it offers an alternative initialization mechanism (“Spectral”). ```r= library(stm) ``` Convert the DFM (with convert()) and calculate the STM (with stm()) ### Topic proportion by page_name (fbslerka) ```r= topic.count <- 20 fbslerka.stm <- convert(fbslerka.kommentare.dfm, to = "stm") fbslerka.stm.labels <- labelTopics(fbslerka.stm.20, 1:topic.count) fbslerka.stm$meta$page_name2 <- as.factor(fbslerka.stm$meta$page_name) #1:12 fbslerka.stm.20 <- stm( fbslerka.stm$documents, fbslerka.stm$vocab, K = topic.count, prevalence = ~ page_name, data = fbslerka.stm$meta, init.type = "Spectral" ) prep.fbslerka.20 <- estimateEffect(1:topic.count ~ page_name2, fbslerka.stm.20, meta = fbslerka.stm$meta, uncertainty = "Global") summary(prep.fbslerka.20, topics=1) par(mfrow=c(3,2)) for (i in 1:20){ plot(prep.fbslerka.20, "page_name2", method = "pointestimate", topics = i, model = fbslerka.stm.20, printlegend = F, main = paste0(fbslerka.stm.labels$prob[i,1:3], collapse = ", "), xlab = 'Topic Proportion', yaxt = "n", labeltype = "custom", custom.labels = fbslerka.stm$meta$page_name) } ``` ### Topic proportion by month (facebook2) ```r= facebook2.stm <- convert(facebook2.dfm, to = "stm") load('facebook2.stm.20.RData') facebook2.stm.20 <- stm( facebook2.stm$documents, facebook2.stm$vocab, K = topic.count, prevalence = ~ monat, data = facebook2.stm$meta, init.type = "Spectral" ) facebook2.stm.labels <- labelTopics(facebook2.stm.20, 1:topic.count) facebook2.stm$meta$monat2 <- 1:18 prep.facebook2.20 <- estimateEffect(1:topic.count ~ monat2, facebook2.stm.20, meta = facebook2.stm$meta, uncertainty = "Global") summary(prep.facebook2.20, topics=1) par(mfrow=c(3,3)) for (i in 1:20){ plot(prep.facebook2.20, "monat2", method = "continuous", topics = i, model = facebook2.stm.20, printlegend = FALSE, main = paste0(facebook2.stm.labels$prob[i,1:3], collapse = ", "), ylab = 'Topic Proportion',xaxt='n') axis(side=1, at=c(2,10,17), labels=c('2015-02','2015-10','2016-05')) } ``` ![](https://i.imgur.com/34qQYBF.png) ### Print the terms that appear in each topic ```r= as.data.frame(t(labelTopics(fbslerka.stm.20, n = 10)$prob)) as.data.frame(t(labelTopics(facebook2.stm.20, n = 10)$prob)) ``` ``` V9 V10 V11 V12 V13 V14 bank wahrheit petry cdu mitschreiben absolut ken freiheit parteien bevölkerung absolut glauben central usw cdu grünen invasoren vorstellen kenfm wählen altparteien kirche gemeint gegenüber russland völker grenzen parteien glauben invasoren bevölkerung bevölkerung wählen usw fazit übelste syrien habt frauke unseren vorstellen nochmal selber compact eigene passiert nochmal schützen frieden lässt bevölkerung linken übelste eur movie deutschlands grünen asylanten drogen gemeint ``` ### Get the share of the different topics at the overall corpus ```r= plot( fbslerka.stm.20, type = "summary", text.cex = 0.7, main = "STM topic shares of fbslerka (20 topics)", xlab = "Share estimation" ) ``` ![](https://i.imgur.com/r4uutxN.png) ```r= plot( facebook2.stm.20, type = "summary", text.cex = 0.7, main = "STM topic shares of facebook2 (20 topics)", xlab = "Share estimation", xlim = c(0, .15) ) ``` ### Visualize different words of a topic with wordcloud Here's the world cloud of topic 3 in fbslerka. ```r= if(!require("wordcloud")) {install.packages("wordcloud"); library("wordcloud")} # select a topic you want to plot stm::cloud(fbslerka.stm.20, topic = 3, scale = c(2.25, .5)) ``` ![](https://i.imgur.com/QoKnUew.png) Here's the world cloud of topic 17 in facebook2. ```r= # select a topic you want to plot stm::cloud(facebook2.stm.20, topic = 17, scale = c(2.25, .5)) ``` ![](https://i.imgur.com/evxRvQP.png) ### Topic shares of fbslerka, facebook2 ```r= plot(fbslerka.stm.20, type = "hist", topics = sample(1:topic.count, size = 9), main = "Histogram of the topic shares within the documents") ``` ![](https://i.imgur.com/GYiEzPq.png) ```r= plot(facebook2.stm.20, type = "hist", topics = sample(1:topic.count, size = 9), main = "Histogram of the topic shares within the documents") ``` ### Topic terms of fbslerka Method 1 View the most appeared terms of topic 11, 18 and 20 in fbslerka. ```r= plot(fbslerka.stm.20, type = "labels", topics = c(20, 11, 18), main = "Topic terms") ``` ![](https://i.imgur.com/uSO1Lc9.png) Method 2 A different way of visualization. ```r= labelTopics(fbslerka.stm.20,c(20, 11, 18)) ``` ![](https://i.imgur.com/fNcKHmC.png) ### Topic contrasts of fbslerka, facebook2 View the contrast between topic 4 and 9 in fbslerka. ```r= plot(fbslerka.stm.20, type = "perspectives", topics = c(4,9), main = "Topic contrasts") ``` ![](https://i.imgur.com/wyq8Nxu.png) View the contrast between topic 3 and 8 in facebook2. ```r= plot(facebook2.stm.20, type = "perspectives", topics = c(3,8), main = "Topic contrasts") ``` ![](https://i.imgur.com/VTUQUVo.png) ### Reference 1. [A field comes of age: tracking research on the internet within communication studies, 1994 to 2018](https://www.tandfonline.com/doi/full/10.1080/24701475.2020.1749805?scroll=top&needAccess=true) 2. [Converging on a nativist core? Comparing issues on the Facebook pages of the Pegida movement and the Alternative for Germany](https://journals.sagepub.com/doi/abs/10.1177/0267323120922068) 3. [Advancing Text Mining with R and quanteda](https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/#lda) 4. [stm: R Package for Structural Topic Models](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf) ###### tags: `NLP` `PuschLab` `Uni Bremen` `Uni Basel`