Try   HackMD

CS1460 Final Project: Reddit Depression Detection

Project Overview

This project reimplements the paper Detecting Symptoms of Depression on Reddit, which aims to detect symptoms of depression in Reddit posts. The goal is to classify posts into symptom-related categories (e.g., anger, anxiety) or control using:

  • LDA: Topic modeling to represent posts as distributions over topics.
  • RoBERTa: Pre-trained transformer model for contextual embeddings of posts.

The implementation uses a single Jupyter Notebook, Reddit Depression.ipynb, where intermediate outputs (e.g., LDA models, embeddings) are saved as .pkl files for reuse to speed up processing.

Here is a video walkthrough of the project.

Here is a link to my notebook.

Structure of the Notebook:

  • Data Preprocessing:
    • Tokenization, mapping subreddits to symptoms, and stop-word removal.
    • Output: combined_df.pkl.
  • LDA Model Training:
    • Trains LDA to represent posts as topic distributions.
    • Output: lda_model.model, df_with_topics.pkl.
  • RoBERTa Embedding Generation:
    • Generates contextual embeddings for each post using RoBERTa.
    • Output: df_with_roberta.pkl.
  • Evaluation:
    • Uses Random Forest and 5-fold cross-validation to evaluate both LDA and RoBERTa features.
    • Output: final_results.pkl.

Outputs

Output File Description
combined_df.pkl Preprocessed data with symptoms mapped
lda_model.model Trained LDA model
df_with_topics.pkl Posts represented as LDA topic distributions
df_with_roberta.pkl Posts represented as RoBERTa embeddings
final_results.pkl ROC AUC scores for each symptom and method

Results

  • Performance:

    Symptom LDA RoBERTa
    anger 0.819849 0.915685
    anhedonia 0.946179 0.956880
    anxiety 0.884495 0.956304
    disordered eating 0.915986 0.957470
    loneliness 0.808022 0.908328
    sad mood 0.788084 0.911753
    self-loathing 0.864496 0.938799
    sleep problem 0.933253 0.976046
    somatic complaint 0.875075 0.934806
    worthlessness 0.697055 0.902052
  • Observations:

    • RoBERTa embeddings generally outperform LDA for most symptoms.