AI Final Project 2021

# AI Final Project 2021 By: Chan Luo Qi (1002983), Ku Wee Tiong (1002713), Seow Xu Liang (1003324), Yuri Kim (1002334) ## Executive Summary This report presents our solution for the COVID-19 Retweet Prediction Challenge. The aim of the challenge is to predict the number of times a COVID-19 related tweet will be retweeted. We explored models from both statistical and deep learning (DL) approaches. Experiments with incremental sets of features confirm their usefulness. We found that the Light Gradient Boosting Machine (LightGBM) model outperforms our more complex DL models despite using less features. Our best-performing LightGBM model, LGB-B, achieved a MSLE of **0.18206** on our test set while our best DL model, DL-C, achieved a MSLE of **0.2365** on the same test set. Finally, we incorporated our best-performing LightGBM model into a GUI and demonstrate its easy usage. ## 1 Task Description Twitter is an online social network where users can follow each other and share information using short text posts called tweets. Twitter provides a function of retweeting which is to repost a post such as a tweet with followers without any change, thus amplifying the spread of original content. Predicting retweets is a crucial task when studying information spreading processes and is useful for many applications such as political audience design or fake news spreading and tracking. Therefore, understanding and modeling retweet behavior has been an active research area and might be particularly helpful during times of crisis, such as the current COVID-19 pandemic. In this project, the task is to predict the number of times a tweet will be retweeted (#retweets), given the set of features for the tweet from TweetsCOV19. ## 2 TweetsCOV19 Dataset Description The TweetsCOV19 dataset provided in the [CIKM2020 AnalytiCup COVID-19 Retweet Prediction Challenge](https://data.gesis.org/covid19challenge/) is a collection of a large number of COVID19-related tweets. These tweets are drawn from a large anonymous and annotated TweetsKB corpus using an initial list of 268 COVID19-related keywords. TweetsCOV19 contains all tweets related to COVID19 from September 2019 to December 2020. The total number of tweets in the dataset is approximately 20 million. For each tweet, the dataset provides the user of the tweet, the time of the tweet, metadata information such as #followers, #favorites, and #friends (see Table 2-1 for descriptions of features and Table 2-2 for statistics), and the some text information of the tweet but not the tweet itself. The text information of a tweet is divided into entities, hashtags, mentions and URLs. ##### Table 2-1: Description of features in dataset | Feature | Description | | ---------- | ----------------------------------------------------------------------------- | | Tweet Id | Unique tweet identifier. | | Username | Unique anonymised username of tweeter. | | Timestamp | Timestamp of tweet. | | #Followers | Number of followers the tweeter has. | | #Friends | Number of followers the tweeter has. | | #Retweets | Number of retweets the tweet has. The target variable to predict. | | #Favorites | Number of favourites the tweet has. | | Entities | A string of entities and scores found in the tweet as produced by FEL library. | | Sentiment | A string of sentiment scores produced by SentiStrength. | | Mentions | A string of mentions of other users in the tweet. | | Hashtags | A string of hashtags in the tweet. | | URLs | A string of URLs in the tweet. | ##### Table 2-2: Five number summary of core numerical features | | followers | friends | retweets | favourites | pos_sentiment | neg_sentiment | | ----- | ----------- | ----------- | ----------- | ----------- | ------------- | ------------- | | count | 1.99696e+07 | 1.99696e+07 | 1.99696e+07 | 1.99696e+07 | 1.99696e+07 | 1.99696e+07 | | mean | 248644 | 3343.71 | 39.6864 | 137.861 | 1.58782 | -1.62244 | | std | 2.23162e+06 | 18528.9 | 568.209 | 2353.04 | 0.755425 | 0.992326 | | min | 0 | 0 | 0 | 0 | 1 | -5 | | 25% | 178 | 181 | 0 | 0 | 1 | -2 | | 50% | 1133 | 567 | 0 | 0 | 1 | -1 | | 75% | 10049 | 1708 | 8 | 17 | 2 | -1 | | max | 1.27387e+08 | 4.48656e+06 | 308686 | 1.06099e+06 | 5 | -1 | Some highlights of the feature statistics in Table 2-2 include: 1. `followers`, `friends`, `retweets` and `favourites` appear to be long tail distributions. These can be better represented via log-transformation. 2. `retweets` and `favourites` are very skewed, with at least 50% of tweets having 0 of either. There may be value in training a classifier to first predict if the tweet will gain 0 `retweets` before using a regressor. ## 3 Approach Overview ### 3.1 Exploration of Methods For this regression problem, the team tried both statistical machine learning and deep learning approaches. We compare the performance of these two approaches to determine the final model used. ![](https://i.imgur.com/l0fddzJ.png) <center>Figure 3-1: Two-pronged approach to the project</center> #### 3.1.1 Statistical Machine Learning methods We explored several Statistical Machine Learning (SML) methods, such as Linear Regression (LR), Support Vector Machine (SVM) and Light Gradient Boosting Machine (LightGBM). Table 3-1 records the results from preliminary tests. The SVM model failed to complete training in reasonable time, indicating its unsuitability for use with large datasets of several million rows. LR provided a baseline MSLE result of 0.24059 and an R2 score of 0.25091, suggesting a poor linear fit. Examining the LR coeffients of features (Figure 3-1) reveals that `favourites`, which has a strong linear correlation with target `retweets`, is the key variable that explains the target while other variables held little linear explanatory power. To explore the value of the other variables which may have non-linear or interacting effects, we decided to focus our efforts on LightGBM (details in Section 4). ##### Table 3-1: Preliminary tests with simple statistical models | Model | MSLE | Remarks | | -------- | -------- | -------- | | LR | 0.24059 | R2 score of 0.25091 | | SVM | n.a. | Failed to complete training in reasonable time | <center><img src="https://i.imgur.com/AlEqrtJ.png"></center> <center>Figure 3-1: The coefficients of features used in Linear Regression model</center> #### 3.1.2 Deep Learning Approaches We evaluate the performance of several deep learning models, which comprise different input features and different number of layers. Later, we also attempted a 2-stage classifier-to-regression model that yielded better performance (Section 5). ### 3.2 Feature Engineering To aid training, we created more features (Table 3-2) from the information provided in the dataset. They include the conversion of non-numerical features (strings) to a reasonable numerical representation. The five number summary of the first three numerical variables are provided in Table 3-3 and indicate that at least 50% of tweets do not contain hashtags, urls or mentions. ##### Table 3-2: Description of features engineered | Feature | Description | | -------------------------- | ------------------------------------------------------------ | | `mentions_length` | The absolute number of mentions contained in a tweet. | | `urls_length` | The absolute number of urls attached to a tweet. | | `hashtags_length` | The absolute number of hashtags used in a tweet. | | `mentions_realdonaldtrump` | A binary that indicates the presence of a mention of `@realdonaldtrump` in tweet. | | `mentions_jaketapper` | A binary that indicates the presence of a mention of `@jaketapper` in tweet. | | `mentions_joebiden` | A binary that indicates the presence of a mention of `@joebiden` in tweet. | | `mentions_narendramodi` | A binary that indicates the presence of a mention of `@narendramodi` in tweet. | | `mentions_pmoindia` | A binary that indicates the presence of a mention of `@pmoindia` in tweet. | | `hashtags_covid19` | A binary that indicates the presence of a hashtag `#covid19` in tweet. | | `hashtags_coronavirus` | A binary that indicates the presence of a hashtag `#corona_virus` in tweet. | | `hashtags_covid_19` | A binary that indicates the presence of a hashtag `#covid_19` in tweet. | | `hashtags_china` | A binary that indicates the presence of a hashtag `#china` in tweet. | | `hashtags_covid--19` | A binary that indicates the presence of a hashtag `#covid--19` in tweet. | | `urls_twitter.com` | A binary that indicates the presence of the domain `twitter.com` in tweet. | | `urls_instagram.com` | A binary that indicates the presence of the domain `instagram.com` in tweet. | | `urls_youtube.com` | A binary that indicates the presence of the domain `youtube.com` in tweet. | | `urls_nytimes.com` | A binary that indicates the presence of the domain `nytimes.com` in tweet. | | `urls_theguardian.com` | A binary that indicates the presence of the domain `theguardian.com` in tweet. ##### Table 3-3: Five number summary of additional numerical features | | urls_length | mentions_length | hashtags_length | | :----: | :--------: | :-------------: | :--------------:| | count | 1.99696e+07 | 1.99696e+07 | 1.99696e+07 | | mean | 0.242886 | 1.09021 | 0.717347 | | std | 0.445667 | 20.3961 | 31.365 | | min | 0 | 0 | 0 | | 25% | 0 | 0 | 0 | | 50% | 0 | 0 | 0 | | 75% | 0 | 1 | 1 | | max | 10 | 41109 | 78102 | Several features were derived from the timestamp of tweets, which in itself is too fine-grained to provide useful information. They are: 1. `hour_of_day` - which may provide some indication of the geographical region of the tweeter and correlate with the availability of other users to see and retweet the tweet 2. `day_of_week` - which may correlate with availability of other users to see and retweet the tweet 3. `day_of_month` 4. `month_of_year` - which may correlate with large, recurring events such as holidays 5. `months_from_start` which is the index of each month period from the start of dataset till end and may indicate changing general sentiments or involvement of influential parties through time Their distribution is shown in Figure 3-2. To visualise correlation between `retweets` and some engineered features, we log-transformed `retweets` using base 10, categorised each magnitude of retweets with letters A-F, and sampled the same number of tweets from each category. A parallel plot of this sample was then drawn with selected variables (see Figure 3-3). Figure 3-3 shows that tweets with more than 100,000 retweets only started appearing near the midpoint of the dataset in terms of `months_from_start`. Figure 3-3 also indicates the lack of obvious correlation between certain engineered features and magnitude of retweets. <center><img src="https://ifh.cc/g/DTx70R.jpg"></center> <center>Figure 3-2: Histogram of selected one-hot categorical features engineered from the tweet timestamp</center> <center><img src="https://i.imgur.com/3eXrpje.png"></center> <center>Figure 3-3: Parallel coordinate plot of tweets log-transformed retweets against selected variables</center> ### 3.3 Feature Pre-Processing Normalisation is important for our dataset the range of values between and within features are significant. Normalisation helps to regulate the values in the dataset to a common scale while preserving relative differences. This in turns helps the model to learn more efficiently. 1. **Normalisation of features.** Standard normalisation was applied to all numerical features that had upper and lower bounds. This includes the following features: * `Pos_sentiment` * `Neg_sentiment` * `urls_length` * `mentions_length` * `hashtags_length` 2. **Log-Normalisation of features.** Features with unbounded values were log-normalised. This was done by first applying the logarithmic function to all values, and then applying the standard normal operation. $$ \operatorname{lognorm}(x) = \frac{\log{x} - \overline{\log{\mathbf{x}}}}{\mathbf{\sigma(\mathbf{\log{x}})}} $$ Log-normalisation was applied to the following features: * `friends` * `followers` * `favourites` * `retweets` ### 3.4 Train and Validation Environment 1. **Consistent training and validation dataset.** To ensure a consistent train and validation environment between all tests, we standardised the train and validation dataset using a custom dataloader object found in `dataloader_all.py`. In this dataloader, all relevant feature pre-processing is performed. At training and inference time, a train-val split is standardised using a fixed random-seed = 42 across all experiments. 2. **Standardised loss function.** The standard loss function we used is the Mean Square Logarithmic Error (MSLE). $$L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(\log(y_i + 1) - \log({\hat{y}}_i + 1))^2$$ where ŷ is the predicted value and both y and ŷ have been unnormalised and untransformed. The MSLE has the advantage over the typical Mean Squared Error (MSE) in that it does not significantly penalise large errors more than small errors in the case where the range of the target value is very large. MSLE does so measuring the ratio between the true and predicted values. This is appropriate for our situation where `retweets` has a long tail distribution in the range [0,XXX]. ## 4 Statistical Machine Learning Approach ### 4.1 About LightGBM [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html) is a free and open source gradient boosting framework developed by Microsoft that uses tree-based learning algorithms. Gradient boosting is an algorithm for developing ensemble models for structured tasks such as regression on tabular data and is often a winning solution to machine learning competitions. LightGBM offers several advantages over the more popular XGBoost algorithm, making it our preferred implementation of gradient boosting: * Accepts categorical data * Faster training speed * Lower memory usage * Support of parallel, distributed and GPU learning However, LightGBM is not able to accept input data as flexibly as Deep Learning methods. The model accepts features that are numerical or one-hot categorical features but not embeddings, or multi-hot vectors. ### 4.2 Hyperparameter Settings and Experiments A validation set was further split from the common train set mentioned in Section 2. This validation set is 1/7 of the common train set, or 10% of the full dataset. Three versions of this model was tested with different features before one model was selected for hyperparameter tuning: 1. **Core model. (Model LGB-O)** Only the core numerical features were used as a baseline test with default parameters: `followers`, `friends`, `favourites`, `pos_sentiment`, `neg_sentiment`. 2. **Basic model. (Model LGB-A)** Additional numerical features were added to the Core model LGB-O: `urls_length`, `mentions_length` and `hashtags_length`. This model is comparable with Model DL-A in Section 5.1. 3. **With added features. (Model LGB-B)** One-hot categorical features derived from feature engineering (Section X) were added to the basic model and represented as categorical data in a pandas DataFrame. They are: `day_of_month`, `day_of_week`, `months_from_start`, `hour_of_day` and `month_of_year`. Variables used and statistics used for standard normalisation are summarised in Table 4-1 and Table 4-2 respectively. Statistics used are calculated after log-transformation of relevant variables was performed. A raw input must first be transformed and normalised before prediction with the model. Outputs of the model will then be reverse-scaled and inverse-transformed for MSLE loss scoring. ##### Table 4-1: Variables used with LightGBM | Variable Type | Variables | Remarks | | ------------------ | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | Target (Numerical) | `retweets` | | | Core Numerical | `followers`, `friends`, `favourites`, `pos_sentiment`, `neg_sentiment` | The first three variables are log-transformed. All numerical variables are standard scaled. | | Basic Numerical | `urls_length`, `mentions_length`, `hashtags_length` | | | Categorical | `hour_of_day`, `day_of_week`, `day_of_month`, `month_of_year`, `months_from_start` | Indicated in Pandas dataframe as a category variable. | ##### Table 4-2: Statistics of Train dataset for Standard Scale | Variable | Mean | Deviation | | ----------------- | ------- | --------- | | `retweets` | 1.2249 | 1.7205 | | `followers` | 7.2478 | 3.1604 | | `friends` | 6.2661 | 1.9458 | | `favourites` | 1.5132 | 2.1553 | | `pos_sentiment` | 1.5877 | 0.75536 | | `neg_sentiment` | -1.6224 | 0.99236 | | `urls_length` | 0.24281 | 0.44564 | | `mentions_length` | 1.0895 | 20.666 | | `hashtags_length` | 0.71730 | 32.963 | #### 4.2.1 Performance of Models with Default Hyperparameters (LGB-O vs LGB-A vs LGB-B) Each model was initialised with the same default hyperparameters as stated in Table 4-4 and yielded validation and test MSLEs recorded in Table 4-3. ```python= bst = lgb.LGBMRegressor( num_leaves=31, learning_rate=0.1, n_estimators=100, reg_alpha=0, reg_lambda=0, objective='regression', random_state=42, max_depth=-1, silent=False) ``` ##### Table 4-3: Performance of Default LGBM Models | Model | Validation MSLE | Test MSLE | Improvement from Previous | | :----:| :-------------: | :-------: | :-----------------------:| | LGB-O | 0.21500 | 0.22173 | n.a. | | LGB-A | 0.19969 | 0.19988 | 0.02185 | | **LGB-B** | 0.19486 | **0.19516** | 0.00472 | Each set of variables added from LGB-O to LGB-A and to LGB-B yielded improvements to test MSLE as stated in Table 4-3. The number of hashtags, mentions and urls appears to provide more useful information for predicting the number of retweets than various features engineered from the timestamp. One limitation of this comparison is that LGB-B is incremental upon LGB-A and may not reflect the true incremental value of timestamp-derived features. LGB-B was chosen to proceed for hyperparameter tuning in the next section. #### 4.2.2 Hyperparameter Tuning with Bayesian Optimisation Hyperparameter tuning was conducted on Model LGB-B with the [Bayesian Optimisation package](https://github.com/fmfn/BayesianOptimization). Table 4-4 describes the hyperparameters tuned and their respective tuning boundaries. The optimiser was set to maximise the negative MSLE of the validation set (i.e. minimise MSLE). The optimiser was also set to intialise at 10 random points before beginning 50 optimising iterations. ```python from bayes_opt import BayesianOptimization # set tuning bounds pbounds = { 'num_leaves': (31, 100), 'min_data_in_leaf':(1, 1000), 'max_bin':(255, 2550), 'learning_rate': (0.001, 1), 'n_estimators': (100,1000), 'reg_alpha':(0,10), 'reg_lambda':(0,10)} # initialise optimizer optimizer = BayesianOptimization( f=train_lgb, # black box training function that yields negative validation MSLE pbounds=pbounds, random_state=42, ) # run optimizer optimizer.maximize( init_points=10, n_iter=50, ) ``` The final tuned parameter values are stated in Table 4-4 and yielded a validation MSLE of **0.18183** which is an improvement of -0.01303 (-6.67%). ##### Table 4-4: Table of LightGBM hyperparameters tuned, their description, tuning boundaries and final tuned value | Hyperparameter | Description | Default | Tuning Bounds | Tuned Value | | ------------------ | ---------- | --- | ---------- | ----------- | | `num_leaves` | Max number of leaves in one tree. | 31 | (31,100) | 93 | | `min_data_in_leaf` | Minimal number of data in one leaf. | 20 | (1,1000) | 783 | | `max_bin` | Max number of bins that feature values will be bucketed in. | 255 | (255,2550) | 2468 | | `learning_rate` | Shrinkage rate | 0.1 | (0.001,1) | 0.38672 | | `n_estimators` | Number of boosting iterations for each training instance. | 100 | (100,1000) | 932 | | `reg_alpha` | L1 regularization coefficient. | 0.0 | (0,10) | 4.0996 | | `reg_lambda` | L2 regularization coefficient. | 0.0 | (0,10) | 3.1099 | The optimisation performed has limitations. While the search space is finite, it is multi-dimensional with known interactivity (e.g. low `learning_rate` and high `n_estimators` together tend to improve accuracy). 10 random and 50 iterations may be insufficient to adequately cover the search space. However, an examination of the range of validation MSLEs reported during the optimisation process indicate that it may not be fruitful to extend the search as finding improvements to the loss is a rare occurence and the variance of the losses is low. ### 4.3 Evaluation of Model Training Process The train and validation L2 loss curve (Figure 4-1) of the tuned model shows model convergence (note that both losses are very similar and hence appear as overlaps). Although the validation loss had largely converged within the first 50 tree iterations, early stopping was not triggered as validation loss kept decreasing. The final ensemble model does not appear to have overfitted as the difference in the train and validation losses did not increase with the iterations. <center><img src="https://i.imgur.com/D9XPBxc.png"></center> <center>Figure 4-1: The validation L2 loss of the tuned LGB-B (lgb_14_bo_31rt)</center> The model achieved a test MSLE of **0.18206**, indicating a variance of +4.42% from train MSLE. The loss results for the model are summarised in Table 4-5. Total inference time for 5990870 samples was 104 seconds. ##### Table 4-5: MSLE for tuned LightGBM model on train, validation and test sets | Dataset | Test MSLE | | ---------- | ---------- | | Train | 0.17436 | | Validation | 0.18183 | | Test | 0.18206 | Figure 4-2 and Figure 4-3 show the feature importance by split and by gain respectively. The more times a feature is used in the model, the higher its split value while its gain value is the total gains of splits which use the feature. The top two features used to split are `friends` and `followers` while the top feature by far with the most gain from its use in splits is `favourites`, which is ranked third in frequency of split usage. It can be clearly seen that `favourites` is the essential feature to predict the target `retweets` while `friends` and `followers` provide the variance for further nuance. The digraph plot of the best tree of the best ensemble model can be found in Appendix Figure 9-1. <center><img src="https://i.imgur.com/faupEbX.png"></center> <center>Figure 4-2: Feature importance by splits of tuned LGB-B (lgb_14_bo_31rt)</center> <center><img src="https://i.imgur.com/ZRxkVIa.png"></center> <center>Figure 4-3: Feature importance by gains of tuned LGB-B (lgb_14_bo_31rt)</center> ## 5 Deep Learning Approach In our exploration, we attempted 2 different model architectures: first, a basic linear model was employed using varying number of input features. Next, we also implemented a two-stage regression model (more details in Section 5.2). We present the performance of the models in #### 5.1 Regression Model We implemented a simple regression model using linear layers as described below. Do note that the number of `in_features` in each layer changes as we vary the number of input features to the model. ``` Regression( (layers): Sequential( (0): Linear(in_features=113, out_features=56, bias=True) (1): LeakyReLU(negative_slope=0.01) (2): Linear(in_features=56, out_features=28, bias=True) (3): LeakyReLU(negative_slope=0.01) (4): Linear(in_features=28, out_features=14, bias=True) (5): LeakyReLU(negative_slope=0.01) (6): Linear(in_features=14, out_features=7, bias=True) (7): LeakyReLU(negative_slope=0.01) (8): Linear(in_features=7, out_features=1, bias=True) ) ) ``` Two iterations of this model was tested with different features: 1. **Basic model. (Model DL-A)** Only numerical features were used as a baseline test: `followers`, `friends`, `favourites`, `pos_sentiment` , `neg_sentiment`, `urls_length`, `mentions_length` and `hashtags_length`. 2. **With added features. (Model DL-B)** Features derived from feature engineering (Section X) were concatenated to the basic model. Of which, categorical data such as `day_of_month`, `day_of_week`, ```months_from_start```, ```hour_of_day``` and ```month_of_year``` were represented as one-hot vectors. This model is generally comparable with our final LightGBM model from Section 4.2 in terms of feature set used after excluding the use of `urls_length`, `mentions_length` and `hashtags_length`. Surprisingly, both Model DL-A and Model DL-B converges at the same training **MSLE loss of 4.4611**. The addition of the categorical date-time features did not help the model to learn better. #### 5.2 Two-Stage Learning (Model DL-C) In our analysis of the dataset, we realised that 53.5% of retweets had a value of zero. This confirms our hypothesis that most tweets do not have any retweets at all, while only a handful go viral. Thus, we postulated that learning might be more efficient if we first identify tweets that are unlikely to gather any retweets at all, and then perform regression only on the complementary set of tweets. Figure 5-2 illustrates our idea. <center><img src=https://i.imgur.com/H90deIT.png></center> <center>Figure 5-1: Illustration of two-stage model</center> ##### 5.2.1 Classifier Training The first stage of this two-step process is a binary classification. Class 0 indicates 0 retweets, while Class 1 indicates > 0 retweets. In this binary classification task, we implemented a simple deep learning model with three layers as shown below. ```pseudocode Classifier( (layers): Sequential( (0): Linear(in_features=113, out_features=128, bias=True) (1): ReLU() (2): Linear(in_features=128, out_features=64, bias=True) (3): ReLU() (4): Linear(in_features=64, out_features=1, bias=True) ) ) ``` The classifier was trained for 10 epochs using the Binary Cross Entropy loss function. Only 10 epochs were required for the model to converge as the training dataset is very large. ##### 5.2.2 Regressor Training To enhance training efficacy, we implemented *feature forcing* in the training of the regressor model. That is, the classifier was assumed to have 100% accuracy at training time, and the regressor was trained against the full set of Class 1 entries. Architecture of model used is as similar to that described in Section 5.1. ##### 5.2.3 Training Loss Figure 5-3 illustrates the training loss of both models. As expected, the classifier converges at a lower loss and a faster rate than the regressor, as it is objectively an easier task. <center><img src=https://i.imgur.com/laRE0wP.png></center> <center>Figure 5-2: Classifier and regression training loss</center> The classifier BCE loss converges at 0.1408, while the regression MSLE loss converges at 0.7321. ##### 5.2.4 Evaluation of Model Finally, the model was evaluated by chaining the classifier output as the input to the regression model. This "chained" loss was recorded as final model loss and was computed for both train and test sets. Train and test loss was plotted against epoch. As the losses at each epoch were very similar (difference is within 2 decimal points), the graph (Figure 5-4) looks like it is overlapping when in reality the test loss is slightly above the train loss. <center><img src=https://i.imgur.com/QnJOSzH.png></center> <center>Figure 5-3: Final train and test loss of two-step classifier </center> The final MSLE Loss of the test set stands at **0.2365**. The closeness of the train and test loss suggests that the model generalises well beyond the training dataset. In fact, there might be room for improvement to increase the complexity of model, or increase complexity of features to encourage better training. #### 5.2.5 Potential Improvements Increasing input feature complexity was attempted in favour of increasing model complexity -- we were already using five regression layers with only 113 input features. This was done by using the twitter GloVe embeddings (pretrained). However, we faced resource limitations (Figure 5-5) in our implementation as we did not have enough RAM for the training to complete. <center><img src=https://i.imgur.com/bLFzdf3.png></center> <center> Figure 5-4: Screenshot of Torch telling us to buy more RAM</center> ### 5.3 Comparison of Model Performances ##### Table 5-1: Performance of DL Models | Model | Description | MSLE Loss on test set | Time taken | | ----- | ------------------------------- | --------------------- | ----------- | | DL-A | Regression using basic features | 4.4654 | 1min 35secs | | DL-B | Regression using all features | 4.4654 | 1min 53secs | | DL-C | Two-stage regression | **0.2365** | 1min 59secs | Table 5-1 summarises the performance of the DL models. Overall, Model DL-C gives the best performance on the test set. The effect of the classifier is obvious, given the significantly lower MSLE loss on the test set. By first going through a binary classifier, we are carrying out a much easier task in identifying tweets that have zero retweets. The subsequent regression task is also easier as it is only a subset of of the task size as compared to in models DL-A and DL-B. ## 6 Results Discussion Our best model developed is the tuned LGB-B model achieving a test MSLE of 0.18206. This is the model we will use with our GUI in Section 7. **LightGBM outperformed DL approaches.** All LightGBM models outperformed the DL models (compare Table 4-3, 4-5 and 5-1) in our tests. For example, the untuned Model LGB-O with only core numerical features achieved test MSLE of 0.22173 which is better than the best-performing Model DL-C's 0.2365. This is a surprising result as we had expected the more complex DL models incorporating more features to outperform. However, this may be a case of limited training opportunity for the DL models. **Limited training for DL approaches.** As we noted in Section 5.2.4, the train and test loss of the DL models were very close. This suggests room for model improvement by increasing model complexity. We attempted to do this but was unsuccessful due to hardware resource limitations. With better hardware and time, the DL models may be able to outperform. **LightGBM may be more suitable for this dataset and challenge if simplicity is important.** From our tests, LightGBM models provide the advantages of better accuracies, model simplicity and faster inference speed compared to DL methods. This may be due to the nature of the TweetsCOV19 dataset. While it is a large dataset in terms of row entries, its dimensionality is relatively low. While there are some natural language processing (NLP) opportunities, the lack of the tweet's actual text content means that much semantic information is lost. Basic NLP features such as positive and negative sentiments of the tweet turned out to have the least predictive power (see Figure 3-1, 4-2 and 4-3). The relative lack of NLP opportunities suggests that this problem is better solved with a statistical approach than a DL approach. **Engineered features are useful.** Our tests with LGB-X models also confirm that our engineered features such as the number of hashtags, mentions and URLs used or timestamp-derived features add information useful for improving retweet predictions. **Comparison with top models from challenge.** Takehara (2020), author of the [third top-performing model](http://ceur-ws.org/Vol-2881/paper3.pdf) in the COVID-19 Retweet Prediction Challenge 2019 used a relatively complex DL approach that incorporated multiple forms of numerical transformations and embeddings of categorical and multi-hot categorical features. Takehara's tests show that the greatest improvements come from transformations on numerical features (e.g. log-transformations) while complex user modeling involving NLP-type features had limited potential, improving MSLE by up to 0.017 only. Takehara's final model achieved a test MSLE of 0.136239. The top-performing team achieved a test MSLE of 0.120551. Compared to the model in Takehara (2020), our LGB-X models are much simpler and hence easier to use and interpret in practical situations. ## 7 GUI Although the original intent of the published dataset was to analyse tweets in the COVID-19 context, we have unfortunately found that features specific to COVID-19 did not contribute much to the retweet value of a tweet in general. Thus, our User Interface aims to provide a quick and fun tool for users to predict the number of retweets they might get on a tweet, just for fun. Following this design direction, we implemented a simple form-based UI. To extract meaningful Tweet features that we need at inference time, we include quirky, relatable descriptions that appeal to a young age group. The form is designed to feel like a BuzzFeed quiz. As concluded in Section 6, tuned LGB-B has the lowest test loss and is the model we used for the GUI. For the LightGBM model, the following features are required in user input: 1. `followers` 2. `friends` 3. `favourites` 4. `pos_sentiment` 5. `neg_sentiment` 6. `hastag_length` 7. `mentions_length` 8. `urls_length` We incorporate these into our UI. <center><img src="https://i.imgur.com/livloNM.png"></center> <center>Figure 7-1: Step 1 of 5 - entering number of followers, friends, as well as an estimate of the average number of favourites user receives </center> The UI only has a single page, and information required from the user is immediately obvious and clear. As annotated in Figure 7-1 and 7-2, users are required to fill out step 1 and step 2, which are the input features required by our model. In step 3, radio buttons are used to ensure exclusivity of the options (Figure 7-3). <center><img src=https://i.imgur.com/1UmxUXK.png></center> <center>Figure 7-2: Step 2 of 3 - entering numebr of hashtags, mentions and URLs used in tweet. </center> <center><img src="https://i.imgur.com/D8zgPWv.png"></center> <center>Figure 7-3: Step 3 of 5 - indicating tweet sentiments. </center> After the user is done with filling up the relevant fields, the user can go ahead to click the "Predict Tweet" button (Figure 7-4). <center><img src="https://i.imgur.com/RoJwRzr.png"></center> <center>Figure 7-4: Step 4 of 5 - click on button to run the prediction model </center> Shortly after, the prediction will appear at the bottom of the screen (Figure 7-5). <center><img src = "https://i.imgur.com/YmNxqTu.jpg"></center> <center>Figure 7-5: Completed prediction process </center> The final predications of the tweet are shown at the bottom of the window. To edit or re-predict, the user can simply hit the predict button again to get a new prediction. ## 8 References Takehara, D. (2020). Feature Extraction for Deep Neural Networks: A Case Study on the COVID-19 Retweet Prediction Challenge. http://ceur-ws.org/Vol-2881/paper3.pdf ## 9 Appendix Figure 9-1 can be viewed at its hosted source in full resolution. <center><img src="https://i.imgur.com/57PKNjQ.jpg"></center> <center>Figure 9-1: Best iteration tree digraph of tuned LGB-B (lgb_14_bo_31_rt)</center>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.