NLP Final Report

### Problem Statement - Stage 1: Tweet sentiment analysis - Bullish (buy the stock), Bearish (sell the stock), Neutral (do nothing). - Stage 2: Stock price movement - ![image](https://hackmd.io/_uploads/Hyp04uIdyg.png) - Perhaps only work on TESLA ### Method - Tweet sentiment analysis: - FinBERT: - The positive label corresponds to buy and the negative label corresponds to sell, not necessarily positive and negative in the traditional sense. - Finetuned by the "[zeroshot/twitter-financial-news-sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment)" dataset - Twitter-roBERTa-base for Sentiment Analysis - BERT - Stock price movement - Features: - Adjusted Closing Price from t-n to t-1 days (from yfinance) - Weighted sentiment scores from stage 1 from t-n to t-1 days (n: lookback window size) - -1: Bearish, 0: Neutral, 1: Bullish - Integrate the sentiment score of all of the tweets in a day by their number of retweets and normalize - Model: Random Forest Classifier ### Result - Tweet Sentiment Analysis - ![image](https://hackmd.io/_uploads/HkJbsu8Oke.png) - The syntax and language on Twitter are noticeably different that other texts in news articles and across the internet, within the subject of finance. Therefore, preprocessing and finetuning are important - Stock price movement - Only use the Finetuned FinBERT since it has the best performance on stage 1 - ![image](https://hackmd.io/_uploads/BkgssdUdkx.png) - This weighted sentiment idea parallels the idea of using attention mechanisms. - Analysis lookback window size - ![image](https://hackmd.io/_uploads/SyaJ3OIuJg.png) - They picked n to be 14 eventually ### How can we improve - Use LSTM to deal with the lookback window - Combine different financial text dataset to make more features - Feature importance - Use this [dataset](https://www.kaggle.com/datasets/utkarshxy/stock-markettweets-lexicon-data/data) $score = (-1) * probability of being negative + 1 * probaility of beging positive$