CS 4641 Proposal

# CS 4641 Proposal ## Introduction Members of online communities frequently discuss interests through online chat messages. Characteristics of these chat messages---like topics discussed, grammar, and style---can relate to a person's background and personality. We would like to infer characteristics of online users by analyzing their chat messages. Past work in natural language processing which analyzes text data focuses for example, on classifying whether a message is positive or negative, a process called sentimental evaluation[^1]. A common technique to do this classification is to use _embeddings_, that is, to first preprocess text into a fixed-length numeric representation. Once such an embedding is generated, the embeddings can be used as input features to a fully supervised deep neural network which can be trained end-to-end. SentEval[^1] provides benchmarks and common datasets to measure the performance of such embeddings. In this project, we will analyze Discord messages from the public Georgia Tech Discord server to relate the style and content of messages to the interests of their authors. Specifically, we seek to answer binary classification questions that classify the author based on facts about them as well as their interests, such as whether or not the author is a computer science major or watches anime. Conveniently, the platform gives users the ability to label themselves by major and interest, for example, through tags, through their self-biography, and through other context. We can therefore automatically annotate messages based on the known major and interests of the author without needing to manually label messages. ## Problem Definition We would like to classify users by their interests based on messages they send in online chat platforms. This problem has some novelty, but touches on real-world applications as well. Constructing demographic profiles for online users is certainly a rapidly growing field: big tech companies like Google and Facebook collect absurd amounts of data to personalize their advertising campaigns for specific users. ## Methods Often, the state-of-the-art results on processing text data use pre-trained so-called *foundation* models, for example, OpenAI's GPT3[^2] and DeepMind's Chinchilla[^3]. These models are trained on huge text corpuses of general data scraped from Wikipedia, Reddit, news articles, and other internet data. We plan to exploit transfer learning, that is, use such foundation models to project text sequence data into a lower dimensional embedding in the ambient real vector space. We can then train a simple fully connected neural network from low dimensional embedding to final prediction, prossibly using a softmax layer or other smoothing to learn a probability distribution for the desired binary classification. ## Potential Results and Discussion Because we have a huge source of labeled messages, we can directly assess the accuracy of our model's predictions on unseen data. Additionally, we would like to evaluate the extent to which our model generalizes to messages in other platforms: Reddit has separate communities dedicated to specific topics, which would give us additional labeled data. ## Timeline and Responsibilities Find our Gantt chart [here][1]. [1]: https://docs.google.com/spreadsheets/d/1xF0QfyKEMA8j1hSQX0l0A4f8lE0F2F2ChQXe_TfYFhU/edit?usp=sharing ## Contribution Table | Group member |Tasks | |---------------|-------------------------------------------------| | Stephen Huan | Literature review and methods; website; video. | | Daniel Lu | Introduction; timeline / Gantt chart; video. | | Dae Seon Park | Literature review; timeline / Gannt chart. | | Adam Zamlynny | Introduction; methods; timeline / Gantt chart. | | Max Zhang | Problem definition; potential results; website. | Additionally, each member worked on the contribution table. ## References [^1]: A. Conneau and D. Kiela, "SentEval: An Evaluation Toolkit for Universal Sentence Representations." arXiv, Mar. 14, 2018. Accessed: Oct. 08, 2022. [Online]. Available: http://arxiv.org/abs/1803.05449 [^2]: T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020. Accessed: Oct. 08, 2022. [Online]. Available: http://arxiv.org/abs/2005.14165 [^3]: J. Hoffmann et al., "Training Compute-Optimal Large Language Models." arXiv, Mar. 29, 2022. Accessed: Oct. 07, 2022. [Online]. Available: http://arxiv.org/abs/2203.15556

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.