# Half-truths | Truth, Fiction and Conspiracy in the 'Post-Factual Age'
German: Halb-Wahrheiten | Wahrheit, Fiktion und Konspiration im 'postfaktischen Zeitalter'
### Introduction of the research project
>Name: Ying Ying Lee
Instructor: Prof. Dr. Cornelius Puschmann
Profession: computational linguistics
Time: 01.Sep,2020 - 31.Dec,2020
Project detail: https://halbwahrheiten.philhist.unibas.ch/de/home/
(The data is in German)
Hello, I am Ying Ying Lee. I worked as a research assistant in this project jointly offered by the University of Bremen, my University, and the University of Basel.
The data sets used in this project are in German. The first data set, fbslerka, contains selected 12 facebook pages including "Alternative fur Deutschland AfD", "JUNGE FREIHEIT", "Dr. Frauke Petry", "Beatrix von Storch", "KOPP Online", "unzensuriert.at", "Netzfrauen", "COMPACT-Magazin", "KenFM", "RT Deutsch", "gegenargument.at" and "PI-News". Whereas the second data set, facebook2, contains selected 6 facebook pages including "alternativefuerde", "faz", "pegidaevdresden", "sz", "welt" and "zeit".
The whole research is done with the programming language R and categorised into 3 parts: General setting and preprocessing, Latent Dirichlet Allocation (LDA) and Structural topic models (STM).
If you have any feedback, please feel free to contact me at yingying.lee025@gmail.com. Thank you.
## General setting and preprocessing
```r=
#library("readtext")
library("quanteda")
library("tidyverse")
library("dplyr")
library("janeaustenr")
library("tidytext")
theme_set(theme_minimal())
```
### Load data
```r=
load("fbslerka_kommentare.RData")
load("woerterbuch_quanteda2.RData")
```
### Apply corpus to fbslerka
```r=
fbslerka.kommentare.korpus <- corpus(fbslerka.kommentare, docid_field = "id", text_field = "message")
fbslerka.kommentare.korpus
docvars(fbslerka.kommentare.korpus)
```

### View word frequency
```r=
fbslerka.kommentare.dfm.init <- fbslerka.kommentare.korpus %>%
tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>%
dfm(remove = c(stopwords("english"), stopwords("german")))
fbslerka.kommentare.stat <-
fbslerka.kommentare.dfm.init %>%
textstat_frequency()
head(fbslerka.kommentare.stat, 20)
```

### Plot word frequency
```r=
ggplot(fbslerka.kommentare.stat[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) +
ggtitle("20 most frequent words in fbslerka") +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency")
```
### Calculate relative frequency by page_name
```r=
fbslerka.kommentare.stat2 <- textstat_frequency(fbslerka.kommentare.weight, n = 10, groups = "page_name")
# plot frequencies
ggplot(data = fbslerka.kommentare.stat2, aes(x = factor(nrow(fbslerka.kommentare.stat2):1), y = frequency)) +
geom_point() +
facet_wrap(~ group, scales = "free", ncol = 4) +
coord_flip() +
scale_x_discrete(breaks = nrow(fbslerka.kommentare.stat2):1,
labels = fbslerka.kommentare.stat2$feature) +
labs(x = NULL, y = "Relative frequency")
```

### Calculate document-feature matrix and remove sentences with less than 40 tokens
Data set: fbslerka grouped by documents
(742265 -> 27006 documents)
```r=
docvars(fbslerka.kommentare.korpus, "monat") <- str_sub(fbslerka.kommentare.korpus$created_time, start = 1, end = 7)
selected_words <- readLines(paste("./selected_words_182.txt", sep = ""))
stoppwoerter <- c(selected_words, stopwords("german"), stopwords("english"))
# removing punct, url and stopwords before constructing ngrams
load("fbslerka.kommentare.dfm.RData")
fbslerka.kommentare.dfm <- fbslerka.kommentare.korpus %>%
corpus_trim(min_ntoken = 40) %>%
#corpus_sample(size = 50000) %>%
tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>%
tokens_remove(stoppwoerter) %>%
#tokens_ngrams(1:3) %>%
dfm(remove = stoppwoerter, groups = "page_name") %>%
dfm_select(min_nchar = 3) %>%
dfm_trim(min_termfreq = 2)
fbslerka.kommentare.dfm
head(dfm_sort(fbslerka.kommentare.dfm, decreasing = TRUE, margin = "both"), n = 12, nf = 10)
```
```
docs #comvsafd afd merkel frau volk politik flüchtlinge europa politiker islam
Alternative für Deutschland AfD 11233 2653 1019 764 656 577 659 489 520 361
Dr. Frauke Petry 0 1324 466 608 447 304 323 308 293 321
JUNGE FREIHEIT 0 450 471 355 349 265 210 240 278 416
Beatrix von Storch 44 586 196 406 157 118 121 154 116 171
KenFM 0 66 69 84 73 92 58 73 47 56
KOPP Online 0 67 174 87 165 117 90 121 101 52
[ reached max_ndoc ... 6 more documents ]
```
### View word frequency after cleasing fbslerka
```r=
fbslerka.kommentare.stat <-
fbslerka.kommentare.dfm %>%
textstat_frequency()
head(fbslerka.kommentare.stat, 50)
```

### The second data set: facebook2
```r=
load("facebook2.RData")
load("woerterbuch_quanteda2.RData")
# summary(woerterbuchK)
```
### Create list of removing words using statistical methods
```r=
facebook2.korpus.dfm.init <- facebook2.korpus %>%
tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>%
dfm(remove = c(stopwords("english"), stopwords("german")))
facebook2.korpus.stat <-
facebook2.korpus.dfm.init %>%
textstat_frequency()
head(facebook2.korpus.stat, 20)
```
### Calculate relative frequency by source
```r=
facebook2.korpus.stat2 <- textstat_frequency(facebook2.korpus.weight, n = 10, groups = "source")
# plot frequencies
ggplot(data = facebook2.korpus.stat2, aes(x = factor(nrow(facebook2.korpus.stat2):1), y = frequency)) +
geom_point() +
facet_wrap(~ group, scales = "free", ncol = 3) +
coord_flip() +
scale_x_discrete(breaks = nrow(facebook2.korpus.stat2):1,
labels = facebook2.korpus.stat2$feature) +
labs(x = NULL, y = "Relative frequency")
```

### Calculate document-feature matrix and remove sentences with less than 40 tokens
Data set: facebook2 grouped by month
(950647 -> 82426 documents)
```r=
selected_words <- readLines(paste("./selected_words_182.txt", sep = ""))
stoppwoerter <- c(selected_words, stopwords("german"), stopwords("english"))
load("facebook2.RData")
docvars(facebook2.korpus, "monat") <- str_sub(facebook2.stats$created_time, start = 1, end = 7)
#save(facebook2.dfm , file = 'facebook2.dfm.RData')
facebook2.dfm <- facebook2.korpus %>%
corpus_trim(min_ntoken = 40) %>%
#corpus_sample(size = 50000) %>%
tokens(remove_url = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE) %>%
tokens_remove(stoppwoerter) %>%
#tokens_ngrams(1:3) %>%
dfm(remove = stoppwoerter, groups = "monat") %>%
dfm_select(min_nchar = 3) %>%
dfm_trim(min_termfreq = 2)
facebook2.dfm
head(dfm_sort(facebook2.dfm, decreasing = TRUE, margin = "both"), n = 10, nf = 20)
```
```
time kinder frau flüchtlinge afd merkel politik frauen europa volk regierung
2016-01 269 714 575 585 601 386 789 247 340 299
2015-01 265 199 96 37 123 290 170 222 337 167
2015-09 360 316 802 60 288 214 194 391 229 205
2016-03 302 458 419 1075 478 372 293 342 247 203
2016-02 321 483 496 838 436 336 328 273 325 204
2015-11 242 292 478 72 319 248 193 327 250 329
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 10 more features ]
```
## Latent Dirichlet Allocation (LDA)
```r=
library("gsl")
library("topicmodels")
library("stringr")
# Assign an arbitrary number of topics
topic.count <- 20
# Convert DFM to a topicmodels object
fbslerka.topicmodels <- convert(fbslerka.kommentare.dfm, to = "topicmodels")
facebook2.topicmodels <- convert(facebook2.dfm, to = "topicmodels")
```
### Calculate the LDA model
```r=
load('fbslerka.lda.20.RData')
load('facebook2.lda.20.RData')
fbslerka.lda.20 <- LDA(fbslerka.topicmodels, topic.count)
facebook2.lda.20 <- LDA(facebook2.topicmodels, topic.count)
as.data.frame(terms(fbslerka.lda.20, 6))
as.data.frame(terms(facebook2.lda.20, 6))
```
Result of fbslerka
```csvpreview {header="true"}
Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
wasser,österreich,petry,#comvsafd,storch,cdu,bank,#comvsafd,#comvsafd
kaufen,zahlen,parteien,parteien,platz,usw,russland,cdu,letzten
fleisch,österreicher,hätten,linke,rassisten,kirche,ken,umfrage,cdu
nestle,grünen,altparteien,cdu,grenze,grünen,central,comvsafd,fdp
tiere,wien,eigene,petry,gewalt,unseren,kenfm,sonstige, linke
``essen,asylanten,versuchen,grüne,grenzen,bevölkerung,syrien,grenzen,usw
```
6 rows | 1-9 of 20 columns
### Terms > Topics
The number values describe the probability of the association of a term with a topic, or the share of a topic in a document
```r=
head(as.data.frame(t(fbslerka.lda.20@beta), row.names = fbslerka.lda.20@terms))
head(as.data.frame(t(facebook2.lda.20@beta), row.names = facebook2.lda.20@terms))
```
```csvpreview {header="true"}
Terms,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
weinende,-602.57, -10.60, -11.71, -533.69, -208.15, -493.37, -9.62, -11.82, -660.97
zigfachen, -693.72, -701.06, -705.97, -711.44, -670.27, -702.99, -10.24, -711.12, -716.80
sexuellen, -10.55, -11.01, -9.42, -9.98, -9.60, -9.05, -7.70, -9.51, -9.12
übergriffen, -10.55, -10.60, -12.53, -11.37, -10.30, -11.13, -8.03, -10.52, -9.21
menge, -8.47, -8.23, -9.12, -8.32, -7.84, -8.36, -8.15, -8.43, -8.26
schlägereien, -654.78, -11.70, -17.39, -538.42, -11.40, -653.08, ```-9.31, -678.33, -104.59
```
### Documents > Topics
```r=
head(as.data.frame(fbslerka.lda.20@gamma, row.names = fbslerka.lda.20@documents))
head(as.data.frame(facebook2.lda.20@gamma, row.names = facebook2.lda.20@documents))
```
```csvpreview {header="true"}
Documents, V1, V2,V3,V4,V5,V6,V7,V8,V9
Alternative für Deutschland (AfD), 9.106133e-08, 9.106133e-08, 7.460871e-04, 1.845340e-01, 9.106135e-08, 9.106135e-08, 9.106134e-08, 1.418453e-01, 2.997342e-01
Beatrix von Storch, 3.477672e-07, 3.477672e-07, 1.711327e-02, 3.477671e-07, 8.963264e-01, 3.477671e-07, 3.477671e-07, 3.477671e-07, 3.477672e-07
COMPACT-Magazin, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07, 8.989153e-07
Dr. Frauke Petry, 1.882329e-07, 1.882329e-07, 1.533403e-01, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07, 1.882329e-07
gegenargument.at, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06, 2.737598e-06
JUNGE FREIHEIT, 1.995489e-07, 1.995489e-07, 1.028358e-04, 1.995490e-07, 1.995489e-07, 6.937254e-01, 1.995489e-07, 1.995490e-07, 1.995490e-07
```
### Plot the similarity among topics
```r=
fbslerka.similarity <- as.data.frame(fbslerka.lda.20@beta) %>%
scale() %>%
dist(method = "euclidean") %>%
hclust(method = "ward.D2")
#par(mar = c(0, 4, 4, 2))
plot(fbslerka.similarity,
main = "LDA topic similarity of fbslerka by features (20 topics)",
xlab = "",
sub = "")
```

```r=
facebook2.similarity <- as.data.frame(facebook2.lda.20@beta) %>%
scale() %>%
dist(method = "euclidean") %>%
hclust(method = "ward.D2")
#par(mar = c(0, 4, 4, 2))
plot(facebook2.similarity,
main = "LDA topic similarity of facebook2 by features (20 topics)",
xlab = "",
sub = "")
```

## Structural topic models (STM)
The STM allows to include metadata into the topicmodel and it offers an alternative initialization mechanism (“Spectral”).
```r=
library(stm)
```
Convert the DFM (with convert()) and calculate the STM (with stm())
### Topic proportion by page_name (fbslerka)
```r=
topic.count <- 20
fbslerka.stm <- convert(fbslerka.kommentare.dfm, to = "stm")
fbslerka.stm.labels <- labelTopics(fbslerka.stm.20, 1:topic.count)
fbslerka.stm$meta$page_name2 <- as.factor(fbslerka.stm$meta$page_name) #1:12
fbslerka.stm.20 <- stm(
fbslerka.stm$documents,
fbslerka.stm$vocab,
K = topic.count,
prevalence = ~ page_name,
data = fbslerka.stm$meta,
init.type = "Spectral"
)
prep.fbslerka.20 <- estimateEffect(1:topic.count ~ page_name2, fbslerka.stm.20,
meta = fbslerka.stm$meta, uncertainty = "Global")
summary(prep.fbslerka.20, topics=1)
par(mfrow=c(3,2))
for (i in 1:20){
plot(prep.fbslerka.20, "page_name2", method = "pointestimate", topics = i,
model = fbslerka.stm.20, printlegend = F, main = paste0(fbslerka.stm.labels$prob[i,1:3], collapse = ", "), xlab = 'Topic Proportion', yaxt = "n", labeltype = "custom", custom.labels = fbslerka.stm$meta$page_name)
}
```
### Topic proportion by month (facebook2)
```r=
facebook2.stm <- convert(facebook2.dfm, to = "stm")
load('facebook2.stm.20.RData')
facebook2.stm.20 <- stm(
facebook2.stm$documents,
facebook2.stm$vocab,
K = topic.count,
prevalence = ~ monat,
data = facebook2.stm$meta,
init.type = "Spectral"
)
facebook2.stm.labels <- labelTopics(facebook2.stm.20, 1:topic.count)
facebook2.stm$meta$monat2 <- 1:18
prep.facebook2.20 <- estimateEffect(1:topic.count ~ monat2, facebook2.stm.20,
meta = facebook2.stm$meta, uncertainty = "Global")
summary(prep.facebook2.20, topics=1)
par(mfrow=c(3,3))
for (i in 1:20){
plot(prep.facebook2.20, "monat2", method = "continuous", topics = i,
model = facebook2.stm.20, printlegend = FALSE, main = paste0(facebook2.stm.labels$prob[i,1:3], collapse = ", "), ylab = 'Topic Proportion',xaxt='n')
axis(side=1, at=c(2,10,17), labels=c('2015-02','2015-10','2016-05'))
}
```

### Print the terms that appear in each topic
```r=
as.data.frame(t(labelTopics(fbslerka.stm.20, n = 10)$prob))
as.data.frame(t(labelTopics(facebook2.stm.20, n = 10)$prob))
```
```
V9 V10 V11 V12 V13 V14
bank wahrheit petry cdu mitschreiben absolut
ken freiheit parteien bevölkerung absolut glauben
central usw cdu grünen invasoren vorstellen
kenfm wählen altparteien kirche gemeint gegenüber
russland völker grenzen parteien glauben invasoren
bevölkerung bevölkerung wählen usw fazit übelste
syrien habt frauke unseren vorstellen nochmal
selber compact eigene passiert nochmal schützen
frieden lässt bevölkerung linken übelste eur
movie deutschlands grünen asylanten drogen gemeint
```
### Get the share of the different topics at the overall corpus
```r=
plot(
fbslerka.stm.20,
type = "summary",
text.cex = 0.7,
main = "STM topic shares of fbslerka (20 topics)",
xlab = "Share estimation"
)
```

```r=
plot(
facebook2.stm.20,
type = "summary",
text.cex = 0.7,
main = "STM topic shares of facebook2 (20 topics)",
xlab = "Share estimation",
xlim = c(0, .15)
)
```
### Visualize different words of a topic with wordcloud
Here's the world cloud of topic 3 in fbslerka.
```r=
if(!require("wordcloud")) {install.packages("wordcloud"); library("wordcloud")}
# select a topic you want to plot
stm::cloud(fbslerka.stm.20,
topic = 3,
scale = c(2.25, .5))
```

Here's the world cloud of topic 17 in facebook2.
```r=
# select a topic you want to plot
stm::cloud(facebook2.stm.20,
topic = 17,
scale = c(2.25, .5))
```

### Topic shares of fbslerka, facebook2
```r=
plot(fbslerka.stm.20, type = "hist", topics = sample(1:topic.count, size = 9), main = "Histogram of the topic shares within the documents")
```

```r=
plot(facebook2.stm.20, type = "hist", topics = sample(1:topic.count, size = 9), main = "Histogram of the topic shares within the documents")
```
### Topic terms of fbslerka
Method 1
View the most appeared terms of topic 11, 18 and 20 in fbslerka.
```r=
plot(fbslerka.stm.20, type = "labels", topics = c(20, 11, 18), main = "Topic terms")
```

Method 2
A different way of visualization.
```r=
labelTopics(fbslerka.stm.20,c(20, 11, 18))
```

### Topic contrasts of fbslerka, facebook2
View the contrast between topic 4 and 9 in fbslerka.
```r=
plot(fbslerka.stm.20, type = "perspectives", topics = c(4,9), main = "Topic contrasts")
```

View the contrast between topic 3 and 8 in facebook2.
```r=
plot(facebook2.stm.20, type = "perspectives", topics = c(3,8), main = "Topic contrasts")
```

### Reference
1. [A field comes of age: tracking research on the internet within communication studies, 1994 to 2018](https://www.tandfonline.com/doi/full/10.1080/24701475.2020.1749805?scroll=top&needAccess=true)
2. [Converging on a nativist core? Comparing issues on the Facebook pages of the Pegida movement and the Alternative for Germany](https://journals.sagepub.com/doi/abs/10.1177/0267323120922068)
3. [Advancing Text Mining with R and quanteda](https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/#lda)
4. [stm: R Package for Structural Topic Models](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf)
###### tags: `NLP` `PuschLab` `Uni Bremen` `Uni Basel`