Simple Sentiment Classification with Spark

Sentiment Classification, in most cases, is a task of binary classification. Usually the goal is to discern negative comments from neutral or positive. There could be variations, where several classes are defined, i.e. negative, neutral, or positive comments.

As it happens, for most of the classification tasks, the idea is very simple, and yet is very unclear how to implement

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Further, several techniques for feature extraction are described for the case of Spark ML DataFrames API.

Unigram Features

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The simplest feature extraction from text includes tokenization and vectorization. The classical approach is to split the string on whitespaces and treat each word as a token. Then, a simple TF/IDF vectorization can be applied. Let's go through feature extraction stages in more details

Preprocessing
This stage involves some text normalization. Each person can write in very different styles, and when it comes to an understanding of the meaning of a text snippet, it is easier to create a model when this variation is removed. Consider a simple example from Twitter Sentiment dataset.
Juuuuuuuuuuuuuuuuussssst Chillin!!
The meaning and sentiment are easy to understand for a human, but computers can struggle with this example. What we ideally need to do is to extract the semantic meaning from this text, along with the style. One of the possible normalizations is
['just chiling', style=['exclamation', 'repetitive characters']]
alternatively, punctuation can be treated as a special token
['just chiling !', style=['repetitive characters']]
When you work with Spark's DataFrame, this processing requires to transform one of the columns, which can be done with the help of User Defined Functions (UDF, ref 1, ref 2). Alternatively, you can implement your own pipeline stage.
Final remark: You must remember that the final choice of features is up to you.
Tokenization
Spark ML provides two pipeline stages for tokenization: Tokenizer and RegexTokenizer. The first one converts the input string to lowercase and then splits it on whitespaces, and the second extracts tokens either by using the provided regex pattern
Stop Words Removal
Spark ML also provides a pipeline stage for stop words removal.
TF/IDF
Spark ML provides several stages for document vectorization. Two of them are TF/IDF and CountVectorizer. Both return a numerical vector that can be treated as input features for a classifier. Read more about the difference of these stages in the documentation.
Be aware of the size of your vocabulary. It grows really fast with the growth of the training data. Each new term adds additional parameters to your model. Often a reasonable approach to constrain dimensionality is to use on N most frequent tokens. Refer to tokenizer documentation for details.

Including NGrams

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sometimes, including N-Grams can improve model accuracy. One of the possible architectures is shown in the figure above. Some auxiliary stages can be added, such as stop words removal (although preserving stop words can be beneficial for n-grams, should be decided with cross-validation).

N-Gram Extraction and Vectorization
Once again, Spark ML has neccessary methods in-place. Keep in mind that the number of possible N-Grams grows very fast. Given the limitations on the available memory and the complexity of the model, you should limit the size of N-Gram vocabulary. Otherwise, your model can easily overfit.
Feature Assembler
Spark ML's pipeline stage VectorAssembler helps to concatenate several vectors together. Thus, you can concatenate vectorized tokens and vectorized N-Grams, creating a single feature vector that you are going to pass to the classifier.

Word Embeddings

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

As an alternative to using token-based vectorization, one can resort to word embeddings. Spark ML provides a pipeline stage for converting an array of tokens into a vector using embeddings. Leaving the details of how exactly embeddings work, consider how the pipeline stage works:

Given the training data (arrays of tokens), the stage trains word embeddings for each token that occurs more than five times (default value) in the dataset. The quality of the vectors depends partially on the number of iterations specified for the stage (usually, the more, the better)
After the vectors are trained, each entry (array of tokens) is combined. Spark ML uses simple summation over word vectors (check the method transform(dataset: Dataset[_])).
Now, each entry is represented by a vector of predefined dimensionality (dimensionality is a parameter of the Word2Vec stage, refer to the documentation for details), and these vectors are passed directly to the classifier.

Since this feature extraction model involves training word vectors, it can take a significantly longer time to train.

Classification

Spark ML implements many classification methods. Since we discussed how to extract numerical features, it is recommended to use methods that can be optimized by gradient descend. The most simple model is LogisticRegression. The documentation has a thorough example of the usage. It can also be applied to the multi-class case.

In some cases, regularization can be useful for the model. Spark ML provides two types of regularization for Logistic Regression (LR). We recommend to use ElasticNet, configured with setElasticNetParam(0.8).

Always evaluate your model after training. Feed some data to the model and look at the output. For LR, the output is probabilities and prediction columns of the DataFrame. If the probabilities are the same or almost the same, this means that your model did not learn. The stopping criteria for training the model is reaching the maximum number of iterations or the convergence threshold (the method setTol).