Assignment 1 Report: Text Categorization

# Assignment 1 Report: Text Categorization Group 30 : Yuang Yuan, Feiyang Sun **Overview:** In this report, we benchmark 3 commonly used classifiers and vectorizers for text categorization on the completed ```'20newsgroups'``` data sets, which contain nearly 20,000 newsgroup documents from 20 different newsgroups. Then we make comparisons between all pairs of classifier and feature on Accuracy, Precision, Recall, and F1. At last, we experiment with different values for the parameters of ```CountVectorizer``` and try to find out the best parameters for converting data into structured form without losing its original meaning. ## Methods We compare three classifiers: ```MultinomialNB``` from Naïve Bayes, ```SGD``` and ```Logistic Regression``` from linear_model. Intuitively, this text dataset in need of text categorization is suitable for SVM and Naive Bayes models which are used most, but other classifiers are worthwhile to try like Logistic Regression for its characteristics of analyzing a probability values instead of a score. #### Classifier - **```MultinomialNB``` from Naïve Bayes** It is a Naïve Bayes classifier with multinomial distributed $p(fi|c)$ s. - **```SGD``` from linear_model** It is a linear classifier with stochastic gradient descent learning. The gradient of the loss is estimated each sample at a time and the model is updated along the way with a specified learning rate. - **```Logistic Regression``` from linear_model** It is a linear model for the classification tasks, too. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. #### Features - **```lowercase```**: Lowercasing everything would do good for simplicity, especially like the beginning of each sentence. And we could assume the letter case will not significantly influence the meaning of words. - **```stop_words```**: Apparently, ```'english'``` will overperformance ```None``` since ```'20newsgroups'``` is in English. - **```analyzer```**: Since the classifier itself scales linearly with feature size, the ```'word'``` unit would be better rather than ```'char'```. - **```max_features```**: It can be set as an integer ```n```, then only ```n``` most frequent terms will be considered. It will narrow down the scale of features into a more dense matrix, which seems better than None. ## Results - **Comparison of 9 combinations of classifiers and features on ```Precision```, ```Recall```, ```F1-score``` performances.** | *Precision/ Recall/ F1-score* | Counts | Tf | Tf-idf | | --------------------------------- | ---------------- | ---------------- | ---------------------- | | **Naïve Bayes** | 0.76/ 0.70/ 0.75 | 0.79/ 0.71/ 0.69 | 0.82/ 0.77/ 0.77 | | **Linear Support Vector Machine** | 0.77/ 0.75/ 0.75 | 0.81/ 0.81/ 0.81 | **0.85/ 0.85/ 0.85** | | **Logistic Regression** | 0.79/ 0.79/ 0.79 | 0.73/ 0.73/ 0.72 | 0.83/ 0.83/ 0.83 | We encountered: ```` ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT ```` when experimenting with ```Logistic Regression``` because it is more suitable for a small dataset. As for relatively big sets (eg. the completed ```'20newsgroups'``` ), this classifier can't manage to converge in the default 100 iterations. --- - **Comparison of different parameters choices of the ```CountVectorizer``` function** As the figure above shows, ```Logistic Regression``` performances the best among the other 2 classifiers with ```CountVectorizer```. We did a grid search to find out the best parameters on the 4 parameter values of ```CountVectorizer```. With ```lowercase = True```, ```stop_words = 'english'```, ```analyzer = 'word'```, ```max_features=None```, our result beats the default configuration result: ```` precision recall f1-score weighted avg 0.81 0.80 0.80 ```` But with ```analyzer = 'char'```, ```max_features=5000```, the model have a worse performance: ``` precision recall f1-score weighted avg 0.27 0.23 0.20 ``` Thus the vectorizer parameters tuning is also crucial to the model performance. ## Discussion ```Linear Support Vector Machine``` with ```Tf-idf``` achieves the best performance in ```Precision```, ```Recall```, ```F1-score``` among all 9 combinations across the completed ```'20newsgroups'``` dataset. Moreover, a higher score is possible by proper parameter tuning. For relatively small sets, we believe ```Logistic Regression``` with ```Tf-idf``` can performance well, too. The time cost is also worthwhile to determine though we didn't discuss it in this report.