Natural Language Processing with Disaster Tweets

# Natural Language Processing with Disaster Tweets [競賽網址](https://www.kaggle.com/competitions/nlp-getting-started/overview) [GitHub Link](https://github.com/andrew76214/Kaggle-NLP_with_Disaster_Tweets) ## Table of Contents - [Introduction](#Introduction) - [Dataset Overview](#Dataset-overview) - [Dataset EDA](#Dataset-EDA) - [Data Preprocessing](#Data-Preprocessing) - [Pipeline](#Pipeline) - [Models Implemented](#Models-Implemented) - [Evaluation](#Evaluation) - [Experimental Record](#Experimental-Record) ## Introduction The **Kaggle-NLP_with_Disaster_Tweets** project is aimed to build a machine learning model that predicts which Tweets are about real disasters and which one's aren't. Key characteristics of the dataset include: - id: a unique identifier for each tweet - text: the text of the tweet - location: the location the tweet was sent from (may be blank) - keyword: a particular keyword from the tweet (may be blank) - target: in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0) ## Dataset Overview ### Files - train.csv: the training set - test.csv: the test set - sample_submission.csv: a sample submission file in the correct format ## Dataset EDA Text data in the training data is analysed by calculating the length and the number of words of each text. ### Text Length and Word Count Analysis #### Hisplot ![image](https://hackmd.io/_uploads/S1qCZjbq1e.png) #### Scatterplot ![image](https://hackmd.io/_uploads/S10CWoWcJe.png) #### Boxplot and Violin Plot ![image](https://hackmd.io/_uploads/Hy8JMobckx.png) ### Text Conent Analysis #### Word Cloud ![image](https://hackmd.io/_uploads/HkikMs-cyx.png) ## Data Preprocessing 1. **Step 1: Text Cleaning** - Removed all URLs, HTML, and acronyms. - Converted all text to lowercase. - Removed punctuation. 2. **Step 2: Handling Missing Values** - Applied `dropna` to ensure data completeness. 3. **Step 3: Text Vectorization** - Applied `CountVectorizer` or `TfidfVectorizer` to convert the cleaned text data into numerical representations suitable for machine learning models. ## Pipeline ```mermaid flowchart TD A[Dataset] --> B[Handeling data] B --> C[Data Preprocessing Completed] C --> D[Deep Learning Models] C --> E[Machine Learning Models 1] C --> F[Machine Learning Models 2] D --> G{Validation Dataset} E --> G F --> G G --> H[Test Dataset] H --> I[Final Evaluation] ``` ## Models Implemented ### Machine Learning Models 1 - **Linear Models**: - Ridge Classifier - Logistic Regression - **Decision Trees and Ensemble Models**: - Decision Tree - Random Fores - XGBoost - Voting Classifier ### Machine Learning Models 2 - **Decision Trees and Ensemble Models**: - XGBoost - Random Forest - LightBGM - Stacking Classifier - Voting Classifier (soft, hard) ### Deep Learning Models - RNN ## Evaluation We use F1 score as our performance metric. ![image](https://hackmd.io/_uploads/HkdpMEXc1x.png) ## Experimental Record ![image](https://hackmd.io/_uploads/BkZk7N7q1x.png) ## Next assignment Text Classification with ML : Logistic Regression, Random Forest, SVM, Gradient Boosting Machines, 如 XGBoost、LightGBM、CatBoost DL : RNN, LSTM, and GRU Objective: Classify IMDB text reviews for sentiment (positive or negative) using RNN, LSTM, and GRU architectures, and compare model performance. What You Should Provide: 1. Data Preprocessing Tokenize and pad the sequences so that each input sequence has the same length. Use an embedding layer (e.g., word embeddings) to convert text data into a format suitable for the models. 2. Model Implementation (LSTM and GRU) Implement RNN, LSTM, and GRU models for text classification. Adjust the hidden layer size and number of layers to ensure each model has a similar configuration, enabling fair comparison. Provide brief descriptions of each model’s structure and rationale for your design choices. 3. Evaluation and Comparison Evaluate each model using accuracy, F1-score, and, if possible, additional relevant metrics like precision and recall. Conduct a deeper comparative analysis in your report, discussing the advantages and limitations of each model for text classification. Explain which model performs best under these conditions and why. 4. Report Summarize your model choices and discuss the differences in performance between RNN, LSTM, and GRU. Highlight which model you would recommend for similar text classification tasks and provide supporting reasons. 5. Summit to Kaggle competition and print the screenshot of the leaderboard which should include your name, score and ranking