Abbreviations used throughout
- minimal (M)
- typical (T)
- extended (E)
- CT conspiracy theories
## Motivation, Implications, Gap, Novelty
## Research Questions
1. M Which sub-groups of conspiracy theory users emerge from their interactions?
2. M How do such groups differ in terms of the sociodemographic attributes of the users?
3. T How do they differ in terms of the conspiracy theories they discuss?
4. E To what degree are the interactions between conspiracy theory users statistically related to their sociodemographic and topical homophily?
## Assumptions and Hypotheses
## Data
Data sources: convokit and waller cover 2005--2018
- M r/conspiracy, January--June 2018
- T 56 subreddits from the list in [[@phadke2021]], January--June 2018
- E 2005--2018 or August 2021, plus exploratively add subreddits that might fit
## Features
### Sociodemographics
- M age, gender, affluence, political leaning from [[@waller2021]]
- T
- E + correlates of conspiracy thinking (academic education, ...) ==todo== review other indicators that are supposedly associated with CT
### Topics
- M
- T BERTopics
- E Named Entities/Subject-Verb-Object triples
## Analyses
- M
- reply interaction network
- community detection on the interaction network
- sensitivity analysis for the community detection
- topic extraction
- topic interpretation
- distribution of topics per user
- distribution of topics per cluster
- statitical analyses of the differences of topics per cluster
- T
- reply network by subreddit
- BERTopics
- E
- validation of the newly extracted dimensions with self-disclosures and correlation with external ground truth on locations
- dimensionality reduction of the bipartite user-NE network
- logistic regression models
----
## Steps
1. collect reddit comment data via subreddits
- minimally: r/conspiracy
- standard: list 56 subreddits
- extended: exploratively add subreddits that might fit
3. extract reply network from comments (minimally: for 1 month in 2018; standard: for 2005--2018)
1. filter users down to $n$ users with following thresholds
- at least x (25) submissions or comments in subreddits
- submissions to at least 5 different subreddits in a given year
- to exclude bots
- need to make comments not only submissions
- not belong to list of known reddit bots
- not having string 'bot'
- posted in more than 50 subreddits each month
2. create matrix for directed graph $n \times n$-matrix
- estimated time: 3 weeks for all reply networks
- speculation:
- growing number of edges as conspiracy theories gain more popularity? (in [[@monti2023]] varying number of edges; in- and de-creasing)
- around 5 to 6 (only r/conspiracy) or even 7 digit number of nodes
4. (optional) compute other socio-demographic dimensions on community level
1. integrate dimensions from previous project (without necessarily using them, but to get insight)
2. compute education status dimension
3. integrate activity level $\sum_s N_{u,s}$ for every user
5. compute socio-demographic scores for users (for each user)
1. get $N_{u,s}$ number of submissions of user $u$ in subreddit $s$
2. simultaneously add all subreddit of this user to subreddit list $S$
3. when finished with all users, save all community scores
- maybe more performant:
- first of two matrices with all (4-7) community scores of all subreddits where at least one user posted one submission in whole time span (for all years): $|dim| \times |s|$
- second of two matrices with number of submissions $|s| \times |u|$-matrix
- compute by multiplying and row sums
- minimally: 4 dimensions from [[@waller2021]] (age, gender, affluence, political leaning)
- standard: minimally + 2 dimensions (education, social)
- extended:
- including user meta data (activity level, variety of subreddits)
- for social dimension **or**
- validation
- validation with external/content data
- NLP: language use, variety of words, length of posts, emotional tone
- validation of *education* (indicated through language use and self-disclosures of profession)
- validation of *online social interaction* (comparison with social dimension from previous project and meta data)
6. identify clusters within reply network
- ==todo==: comparing methods (louvain, girwan newman, other)
1. apply louvain method (modularity optimization) for clustering only on network structure
- detect socio-demographic classes within or simultaneously to clusters? ==todo==
7. find classes of certain socio-demographic characteristic attributes/ look at distribution of socio-demographic scores for each cluster
- compute means and optionally distributions of socio demographic scores within clusters
- to get a first impression on influence of status homophily
8. create SD-model
- ==todo== workload and what should be actually modified for our use-case?
9. create SD+t-model
1. classify content with named entity recognition ==todo==
| Steps | Planned Time period |
| ---------------------------------------------------------------------- | ------------------- |
| 1. collect r/Conspiracy comment data | 1/2 to 1 week |
| 2. extract reply network | 1-2 weeks |
| 3. compute socio-demographic scores for users | 1-2 weeks |
| 4. identify clusters within reply network | 1 week |
| 5. view socio-demographic mean and distribution clusters | 1 week |
| 6. logistic regression of interaction and SE characteristic attributes | 2-3 weeks |
| 7. Generate SD model | ==todo== |
| 8. Classify content | ==todo== |
| 8. Generate SD+t model | ==todo== |
| 7. Writing | 4 weeks |
| 8. Correcting | 1 week |
- **17 weeks in total since formal registration** (inclusing 1-2 weeks of christimas holidays)
# Notes
### Extra extra extra
- temporal dimension
- test-retest pre-post covid
- cross-country
### reddit data
- 56 subreddits [[@phadke2021]]
- variance in degree of more similar to *r/science* or *r/conspiracy*
- fallback: *r/conspiracy* (currently 2 mio members)
- timeline
- proposal: 2018 - 2022
- would capture the timeline over covid and growth of conspiracy theories
- also 5 year span like in [[@monti2023]] about r/news
- dependend on availability
- sources and availability
- convokit: data via subreddit
- https://convokit.cornell.edu/documentation/subreddit.html
- from inception till october 2018 (but uploaded october 2019)
- speaker-level
- utterance-level
- conversation-level
- corpus-level (list of subreddits and other number)
- correctness
- completeness not guaranteed (beta version release)
- reply-to ID may not match any utterance that exists
- ~~praw (python package?)~~
- pushift datasets from 2005-6 till 2023-03 from [reddit post](https://www.reddit.com/r/pushshift/comments/146r0dx/historical_data_torrents_all_in_one_place/)`
- in zst-files
- from 2005-06 to 2022-12 via [academic torrents](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee)
- "**License:** No license specified, the work may be protected by copyright."
### socio-demographic dimensions
- ideas
- proposal:
- minimally: **age, gender, affluence, political leaning** from [[@waller2021]]
- extra:
- test how do the three social dimensions from previous project affect
- extraction of
1. **academic education** level (construct on waller & anderson)
- similar to previous project but reduction to academic status (sociologically: institutionalized cultural capital)
- validation
- maybe not so necessary because we orient on subreddits on elite universities and subreddits of academic disourse
- otherwise: similar NLP validation as in previous project (for cultural capital)
2. social interaction
- from previous project
- otherwise: only online social capital
- whether the person is in subreddits about being lonely, social anxiety etc. (oriented on previous project)
- oriented on activity and interaction level of user
- volume of postings, variety in subreddits
- maybe: interaction (in form of amount of upvotes and comments)
- on the one hand: socio-demographic dimensions on which conspiracy theorists are often labelled
- age?
- gender=male, education=low, political leaning=conservative, affluence=low, family status=unmarried, social interaction=weak/low, ethnicity= member of ethnic minority group [[@freeman2017]]
- plus: education, family status (self-disclosures?), social interaction
- on the other hand: which socio-demographic attributes in terms of status-homophily?
- explorative
- strong status homophily in age and affluence in *r/news* [[@monti2023]] (from 4 dimensions: age, gender, affluence, political leaning of [[@waller2021]])
- availability:
- 4 dimensions [[@waller2021]]: age, gender, affluence, political leaning
- maybe from previous project: cultural and social capital
- **optional**: other non-socio-demographic dimensions from meta-data
- activity-level of individual user
- social reddit capital: how much do people upvote/comment on the user's submissions