Abbreviations used throughout - minimal (M) - typical (T) - extended (E) - CT conspiracy theories ## Motivation, Implications, Gap, Novelty ## Research Questions 1. M Which sub-groups of conspiracy theory users emerge from their interactions? 2. M How do such groups differ in terms of the sociodemographic attributes of the users? 3. T How do they differ in terms of the conspiracy theories they discuss? 4. E To what degree are the interactions between conspiracy theory users statistically related to their sociodemographic and topical homophily? ## Assumptions and Hypotheses ## Data Data sources: convokit and waller cover 2005--2018 - M r/conspiracy, January--June 2018 - T 56 subreddits from the list in [[@phadke2021]], January--June 2018 - E 2005--2018 or August 2021, plus exploratively add subreddits that might fit ## Features ### Sociodemographics - M age, gender, affluence, political leaning from [[@waller2021]] - T - E + correlates of conspiracy thinking (academic education, ...) ==todo== review other indicators that are supposedly associated with CT ### Topics - M - T BERTopics - E Named Entities/Subject-Verb-Object triples ## Analyses - M - reply interaction network - community detection on the interaction network - sensitivity analysis for the community detection - topic extraction - topic interpretation - distribution of topics per user - distribution of topics per cluster - statitical analyses of the differences of topics per cluster - T - reply network by subreddit - BERTopics - E - validation of the newly extracted dimensions with self-disclosures and correlation with external ground truth on locations - dimensionality reduction of the bipartite user-NE network - logistic regression models ---- ## Steps 1. collect reddit comment data via subreddits - minimally: r/conspiracy - standard: list 56 subreddits - extended: exploratively add subreddits that might fit 3. extract reply network from comments (minimally: for 1 month in 2018; standard: for 2005--2018) 1. filter users down to $n$ users with following thresholds - at least x (25) submissions or comments in subreddits - submissions to at least 5 different subreddits in a given year - to exclude bots - need to make comments not only submissions - not belong to list of known reddit bots - not having string 'bot' - posted in more than 50 subreddits each month 2. create matrix for directed graph $n \times n$-matrix - estimated time: 3 weeks for all reply networks - speculation: - growing number of edges as conspiracy theories gain more popularity? (in [[@monti2023]] varying number of edges; in- and de-creasing) - around 5 to 6 (only r/conspiracy) or even 7 digit number of nodes 4. (optional) compute other socio-demographic dimensions on community level 1. integrate dimensions from previous project (without necessarily using them, but to get insight) 2. compute education status dimension 3. integrate activity level $\sum_s N_{u,s}$ for every user 5. compute socio-demographic scores for users (for each user) 1. get $N_{u,s}$ number of submissions of user $u$ in subreddit $s$ 2. simultaneously add all subreddit of this user to subreddit list $S$ 3. when finished with all users, save all community scores - maybe more performant: - first of two matrices with all (4-7) community scores of all subreddits where at least one user posted one submission in whole time span (for all years): $|dim| \times |s|$ - second of two matrices with number of submissions $|s| \times |u|$-matrix - compute by multiplying and row sums - minimally: 4 dimensions from [[@waller2021]] (age, gender, affluence, political leaning) - standard: minimally + 2 dimensions (education, social) - extended: - including user meta data (activity level, variety of subreddits) - for social dimension **or** - validation - validation with external/content data - NLP: language use, variety of words, length of posts, emotional tone - validation of *education* (indicated through language use and self-disclosures of profession) - validation of *online social interaction* (comparison with social dimension from previous project and meta data) 6. identify clusters within reply network - ==todo==: comparing methods (louvain, girwan newman, other) 1. apply louvain method (modularity optimization) for clustering only on network structure - detect socio-demographic classes within or simultaneously to clusters? ==todo== 7. find classes of certain socio-demographic characteristic attributes/ look at distribution of socio-demographic scores for each cluster - compute means and optionally distributions of socio demographic scores within clusters - to get a first impression on influence of status homophily 8. create SD-model - ==todo== workload and what should be actually modified for our use-case? 9. create SD+t-model 1. classify content with named entity recognition ==todo== | Steps | Planned Time period | | ---------------------------------------------------------------------- | ------------------- | | 1. collect r/Conspiracy comment data | 1/2 to 1 week | | 2. extract reply network | 1-2 weeks | | 3. compute socio-demographic scores for users | 1-2 weeks | | 4. identify clusters within reply network | 1 week | | 5. view socio-demographic mean and distribution clusters | 1 week | | 6. logistic regression of interaction and SE characteristic attributes | 2-3 weeks | | 7. Generate SD model | ==todo== | | 8. Classify content | ==todo== | | 8. Generate SD+t model | ==todo== | | 7. Writing | 4 weeks | | 8. Correcting | 1 week | - **17 weeks in total since formal registration** (inclusing 1-2 weeks of christimas holidays) # Notes ### Extra extra extra - temporal dimension - test-retest pre-post covid - cross-country ### reddit data - 56 subreddits [[@phadke2021]] - variance in degree of more similar to *r/science* or *r/conspiracy* - fallback: *r/conspiracy* (currently 2 mio members) - timeline - proposal: 2018 - 2022 - would capture the timeline over covid and growth of conspiracy theories - also 5 year span like in [[@monti2023]] about r/news - dependend on availability - sources and availability - convokit: data via subreddit - https://convokit.cornell.edu/documentation/subreddit.html - from inception till october 2018 (but uploaded october 2019) - speaker-level - utterance-level - conversation-level - corpus-level (list of subreddits and other number) - correctness - completeness not guaranteed (beta version release) - reply-to ID may not match any utterance that exists - ~~praw (python package?)~~ - pushift datasets from 2005-6 till 2023-03 from [reddit post](https://www.reddit.com/r/pushshift/comments/146r0dx/historical_data_torrents_all_in_one_place/)` - in zst-files - from 2005-06 to 2022-12 via [academic torrents](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee) - "**License:** No license specified, the work may be protected by copyright." ### socio-demographic dimensions - ideas - proposal: - minimally: **age, gender, affluence, political leaning** from [[@waller2021]] - extra: - test how do the three social dimensions from previous project affect - extraction of 1. **academic education** level (construct on waller & anderson) - similar to previous project but reduction to academic status (sociologically: institutionalized cultural capital) - validation - maybe not so necessary because we orient on subreddits on elite universities and subreddits of academic disourse - otherwise: similar NLP validation as in previous project (for cultural capital) 2. social interaction - from previous project - otherwise: only online social capital - whether the person is in subreddits about being lonely, social anxiety etc. (oriented on previous project) - oriented on activity and interaction level of user - volume of postings, variety in subreddits - maybe: interaction (in form of amount of upvotes and comments) - on the one hand: socio-demographic dimensions on which conspiracy theorists are often labelled - age? - gender=male, education=low, political leaning=conservative, affluence=low, family status=unmarried, social interaction=weak/low, ethnicity= member of ethnic minority group [[@freeman2017]] - plus: education, family status (self-disclosures?), social interaction - on the other hand: which socio-demographic attributes in terms of status-homophily? - explorative - strong status homophily in age and affluence in *r/news* [[@monti2023]] (from 4 dimensions: age, gender, affluence, political leaning of [[@waller2021]]) - availability: - 4 dimensions [[@waller2021]]: age, gender, affluence, political leaning - maybe from previous project: cultural and social capital - **optional**: other non-socio-demographic dimensions from meta-data - activity-level of individual user - social reddit capital: how much do people upvote/comment on the user's submissions