CS1460 Final Project: Reddit Depression Detection

Project Overview

This project reimplements the paper Detecting Symptoms of Depression on Reddit, which aims to detect symptoms of depression in Reddit posts. The goal is to classify posts into symptom-related categories (e.g., anger, anxiety) or control using:

LDA: Topic modeling to represent posts as distributions over topics.
RoBERTa: Pre-trained transformer model for contextual embeddings of posts.

The implementation uses a single Jupyter Notebook, Reddit Depression.ipynb, where intermediate outputs (e.g., LDA models, embeddings) are saved as .pkl files for reuse to speed up processing.

Links

Here is a video walkthrough of the project.

Here is a link to my notebook.

Structure of the Notebook:

Data Preprocessing:
- Tokenization, mapping subreddits to symptoms, and stop-word removal.
- Output: combined_df.pkl.
LDA Model Training:
- Trains LDA to represent posts as topic distributions.
- Output: lda_model.model, df_with_topics.pkl.
RoBERTa Embedding Generation:
- Generates contextual embeddings for each post using RoBERTa.
- Output: df_with_roberta.pkl.
Evaluation:
- Uses Random Forest and 5-fold cross-validation to evaluate both LDA and RoBERTa features.
- Output: final_results.pkl.

Outputs

Output File	Description
`combined_df.pkl`	Preprocessed data with symptoms mapped
`lda_model.model`	Trained LDA model
`df_with_topics.pkl`	Posts represented as LDA topic distributions
`df_with_roberta.pkl`	Posts represented as RoBERTa embeddings
`final_results.pkl`	ROC AUC scores for each symptom and method

Results

Performance:

Symptom	LDA	RoBERTa
anger	0.819849	0.915685
anhedonia	0.946179	0.956880
anxiety	0.884495	0.956304
disordered eating	0.915986	0.957470
loneliness	0.808022	0.908328
sad mood	0.788084	0.911753
self-loathing	0.864496	0.938799
sleep problem	0.933253	0.976046
somatic complaint	0.875075	0.934806
worthlessness	0.697055	0.902052

Observations:
- RoBERTa embeddings generally outperform LDA for most symptoms.