112-2 DataMining Final

# 112-2 DataMining Final ## Q1 ### Describe how you solve this problem. Details include preprocessing, embeddings, model selection, hyparameters should be provided. - Preprocessing and Embeddings: 1. Reading data and converting it into text embeddings: Use pandas to read behaviors and news. The behaviors.tsv includes news_id, clicked_news, and impressions. The news.tsv includes news_id, category, subcategory, and title. Combine the news information (news_id, category, subcategory, and title) into a single string for subsequent processing. Use AutoModel and AutoTokenizer from the transformers library to perform text embeddings on the news information. The model selected is Alibaba-NLP/gte-large-en-v1.5. 2. Training data generation: Obtain the text embeddings of the news information clicked by each user, and calculate the average of these embedding vectors as the final user representation. For each exposed news item in the user's behavior data, generate the corresponding embedding vector and generate training labels based on whether the exposed news was clicked (click label). The final training data includes: user embedding vectors (average of the embedding vectors of the clicked news) and embedding vectors of each exposed news item, along with the corresponding click labels. - Model Selection and Hyperparameters 1. Catboost - Strengths: Often performs well on tabular data, robust to outliers, handles categorical features well. - Hyperparameters to consider: - `iterations`: Number of boosting rounds (trees). More iterations can improve accuracy but increase training time. **500** - `depth`: Depth of each tree. Deeper trees can capture complex relationships but risk overfitting. **6** - `learning_rate`: Controls the step size in each iteration. Smaller values make the model learn slower but can lead to a better optimum. **0.1** 2. Neural Network - Strengths: Can learn complex non-linear relationships, flexible for a wide range of data types. - Hyperparameters to consider: - `epoch`: Number of passes over the entire training dataset. More epochs might improve accuracy but can overfit. **100** - `batch_size`: Number of samples processed in one training iteration. Larger batch sizes can speed up training but may require more memory. **128** - `learning_rate`: Similar to CatBoost, controls the step size during optimization. **0.0005** - `hidden_size`: Number of neurons in the hidden layers. Larger hidden sizes increase model capacity but can overfit. **4096** 3. Consine Similarity - Strengths: Simple and efficient, directly measures the similarity between user and news embeddings. - Hyperparameters: This approach doesn't have typical hyperparameters. ## Q2 ### Choose a variable (e.g. different model, different approach) excluding hyperparameters and compare their performance. - In the previous part, we mentioned that we used three different approaches to predict the probability of clicked impressions news. - The following table shows the comparison of their performance. | Model Type | Validation AUC | Public Score | | -------- | -------- | -------- | | Catboost | 0.7283 | 0.6895 | | Neural Network | 0.75 | 0.6939 | | Cosine Similarity | 0.612 | 0.6062 | ### Explain what causes the difference of performance or why. - During experiments with different parameter settings and datasets, it was found that with smaller datasets, using CatBoost could achieve a better AUC score. However, because CatBoost requires loading the entire dataset into memory at once for training, while the Neural Network approach can save memory usage through PyTorch's DataLoader and Batch mechanisms, this resulted in a limitation on our machine (128GB RAM) where CatBoost could only train on up to 1 million data points, whereas the Neural Network could train on the entire dataset of over 4 million data points. This led to the Neural Network showing stronger performance on the test dataset. - The Cosine Similarity approach, on the other hand, cannot be optimized through training data. It simply compares the similarity between user embeddings and news embeddings in the test dataset to predict click probabilities. Without effectively utilizing the dataset, its performance ultimately fell short compared to CatBoost and Neural Network methods. ## Q3 ### Do some error analysis or case study. Is there anything worth mentioning while checking the mispredicted data? Share with us. - Because we have separate validation and train datasets, we can use our trained model to predict on the validation set to check if there is any mispredicted data. - We found that in `236945 U1589325 11/9/2019 12:34:44 PM` data, the user clicked news is as follows: | News ID | Category + Subcategory + Title | |-----------|---------------------------------| | N447001 | sports + football_nfl + NFL winners, losers: Cowboys need to rebound, Patriots keep winning | | N120577 | sports + football_nfl + NFL insider on Mahomes knee injury: The Chiefs' QB will likely miss at least 3 weeks | | N389720 | sports + football_nfl + Matt Moore takes the reins of the Chiefs offense in win over Denver | | N461666 | sports + football_nfl + Chiefs sound like they're taking Patrick Mahomes injury in stride | | N595037 | sports + football_nfl + Should the Chiefs trade for Marcus Mariota? | | N311964 | sports + football_nfl + Arrowheadlines: Mecole Hardman is getting open | | N323612 | sports + football_nfl + Arrowheadlines: Travis Kelce was 'disgusted' with how he played vs. Packers | | N434030 | sports + football_nfl + Five things we learned in the Chiefs' 31-24 loss to the Packers | | N537399 | sports + football_nfl + Rounding up the latest on Chiefs injuries ahead of Vikings week | | N603451 | lifestyle + lifestylehomeandgarden + 17 photos that show the ugly truth of living in a tiny house | | N1304 | sports + football_nfl + Chiefs final injury report vs. Vikings: Patrick Mahomes is questionable | - But when we looked at model predictions, there is a specific news `N873063 sports football_nfl Chiefs Market Movers heading into Sunday's Titans game`. It seems like the user is interested, and our model also predicted a very high score `0.90`, but the label is actually `0`. It seems like the model correctly understands that the user likes news about sports, NFL, and the Chiefs team. However, in reality, the user may not necessarily click on highly relevant articles, which shows the challenge of recommendation systems. - Besides predicting clicks that did not happen, we also found in `N441261 autos autossports How the Mid-Engine Corvette Helped Chevy Make a Better Small Block` that the model predicted: `0.074972056`, but the correct label was `1`. - Our analysis shows that this happened because the user's past click records were all football-related. Although this mispredicted article is also sports-related, our model's training method has difficulty finding these implicit correlations since we only considered the relationship between the user's past clicked articles and new articles. These types of correlations might require more complex prediction models, such as Graph-based prediction methods, to achieve better results.