# CS 4641 Proposal
## Introduction
Members of online communities frequently discuss interests through online chat messages. Characteristics of these chat messages---like topics discussed, grammar, and style---can relate to a person's background and personality. We would like to infer characteristics of online users by analyzing their chat messages. Past work in natural language processing which analyzes text data focuses for example, on classifying whether a message is positive or negative, a process called sentimental evaluation[^1]. A common technique to do this classification is to use _embeddings_, that is, to first preprocess text into a fixed-length numeric representation. Once such an embedding is generated, the embeddings can be used as input features to a fully supervised deep neural network which can be trained end-to-end. SentEval[^1] provides benchmarks and common datasets to measure the performance of such embeddings.
In this project, we will analyze Discord messages from the public Georgia Tech Discord server to relate the style and content of messages to the interests of their authors. Specifically, we seek to answer binary classification questions that classify the author based on facts about them as well as their interests, such as whether or not the author is a computer science major or watches anime. Conveniently, the platform gives users the ability to label themselves by major and interest, for example, through tags, through their self-biography, and through other context. We can therefore automatically annotate messages based on the known major and interests of the author without needing to manually label messages.
## Problem Definition
We would like to classify users by their interests based on messages they send in online chat platforms. This problem has some novelty, but touches on real-world applications as well. Constructing demographic profiles for online users is certainly a rapidly growing field: big tech companies like Google and Facebook collect absurd amounts of data to personalize their advertising campaigns for specific users.
## Methods
Often, the state-of-the-art results on processing text data use pre-trained so-called *foundation* models, for example, OpenAI's GPT3[^2] and DeepMind's Chinchilla[^3]. These models are trained on huge text corpuses of general data scraped from Wikipedia, Reddit, news articles, and other internet data. We plan to exploit transfer learning, that is, use such foundation models to project text sequence data into a lower dimensional embedding in the ambient real vector space. We can then train a simple fully connected neural network from low dimensional embedding to final prediction, prossibly using a softmax layer or other smoothing to learn a probability distribution for the desired binary classification.
## Potential Results and Discussion
Because we have a huge source of labeled messages, we can directly assess the accuracy of our model's predictions on unseen data. Additionally, we would like to evaluate the extent to which our model generalizes to messages in other platforms: Reddit has separate communities dedicated to specific topics, which would give us additional labeled data.
## Timeline and Responsibilities
Find our Gantt chart [here][1].
[1]: https://docs.google.com/spreadsheets/d/1xF0QfyKEMA8j1hSQX0l0A4f8lE0F2F2ChQXe_TfYFhU/edit?usp=sharing
## Contribution Table
| Group member |Tasks |
|---------------|-------------------------------------------------|
| Stephen Huan | Literature review and methods; website; video. |
| Daniel Lu | Introduction; timeline / Gantt chart; video. |
| Dae Seon Park | Literature review; timeline / Gannt chart. |
| Adam Zamlynny | Introduction; methods; timeline / Gantt chart. |
| Max Zhang | Problem definition; potential results; website. |
Additionally, each member worked on the contribution table.
## References
[^1]: A. Conneau and D. Kiela, "SentEval: An Evaluation Toolkit for Universal Sentence Representations." arXiv, Mar. 14, 2018. Accessed: Oct. 08, 2022. [Online]. Available: http://arxiv.org/abs/1803.05449
[^2]: T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020. Accessed: Oct. 08, 2022. [Online]. Available: http://arxiv.org/abs/2005.14165
[^3]: J. Hoffmann et al., "Training Compute-Optimal Large Language Models." arXiv, Mar. 29, 2022. Accessed: Oct. 07, 2022. [Online]. Available: http://arxiv.org/abs/2203.15556