Assignment №2. Introduction to Big Data. Stream Processing with Spark

Due Date: 8-th October 2019, 23:55
Teams: no less than three and no more than four students. Team representative should send list names to your TA
Rule for new team: no ex-teammates.
Assignment Details: Should be implemented in Scala. Cluster address is the same. Using this link, you can check the cluster status. In general, this task is not computationally intensive. (But we recommend you to perform hyperparameter search and cross-validation on the cluster)
Stream Address: 10.90.138.32:8989
Report: Read Non-Technical Guide to Writing a Report to understand how to present your work in the best way.
Submission Format: report, link to GitHub, compiled binaries, and an example of the output. Store your full outputs and your trained model in your groups' folder in HDFS.
Grading policy: Individual grade is based on the role in the team (contribution). The description of the personal contribution should be provided in the report.

In this Assignment, you are going to stream (sort of) tweets and analyze their sentiment using Spark Stream.

Assignment №2. Introduction to Big Data. Stream Processing with Spark

Pipeline

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The processing pipeline is shown in the image above. Let's discuss each of the steps.

Streaming Tweets

Spark Stream creates an abstraction over streams that process stream data almost identically to RDDs or DataFrames. The most common (easy to work with) stream format is Discretized Stream (DStream). Alternatively, you can convert your stream to Spark DataFrame and process it using SQL Operations. All the details of how to create a stream object, the list of transformations implemented for streams, ways to convert DStream to DataFrame are provided in the official programming guide. Read it through before beginning your work.

One of the common stream formats is DStream containing lines. This format is almost identical to RDD[String]. The list of available transformations for DStream can be found here. Each entry in the stream is a line - e.g.

thrilled about being at work this morning

This line is a single tweet.

After receiving this line, you can apply a series of transformations. For our task, we are mostly interested in classifying tweets by sentiment.

Preprocessing Steps

Familiarize yourself with the format of the stream. We recommend that you turn to spark documentation. The preprocessing is as simple as writing several map functions. The goal of preprocessing is usually data normalization.

Building a Sentiment Classifier

The goal of the sentiment classification task is to figure out positive and negative (or neutral) tweets.

The most widely used application is product review analysis. Sentiment analysis is one of the most common tools for assessing the perception of a company by its customers. In most of the cases, the problem of sentiment analysis can be reduced to classification into two or three classes. For more information about sentiment classification, read this note.

Selecting Datasets

To train a sentiment classifier, one needs to collect a labeled dataset. There are several free datasets that you can use.

Large Movie Review Dataset
This is a dataset for binary sentiment classification (positive or negative) containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing
Sentiment Tree Bank
Contains sentences with continuous sentiment measure: 5 classes can be distinguished by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
UCI Sentiment Labelled Sentences
Contains binary classified reviews from imdb.com, amazon.com, and yelp.com.
Twitter Sentiment
Contains messages from twitter labeled with one of three classes: neutral, positive, negative.

You can use one of these datasets to train your model or even merge these datasets to improve the results. But remember that the stream comes from twitter.

Feature Extraction

To get good results in the classification task you need to provide the algorithm with a suitable set of features. Read the documentation about methods available in Spark for feature extraction and feature transformation. Some useful methods are TF/IDF, CountVectorizer, FeatureHasher, Word2Vec, Tokenizer, StopWordRemover, n-gram, and other.

Selecting Classifier Model

There are several ways to approach the classification problem. In this homework, you can either use standard Spark methods or implement your own classifier. There is a multitude of algorithms available in Spark. All you need to do is to select the ones useful for your task. Alternatively, you can implement your own algorithm. You need to try at least two algorithms from standard ones or implement one of them on your own.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

You need to train and compare models using F1 scores. Then you need to run the models on the stream and provide to output files with your sentiments classes.

In the report, you should elaborate more on results - provide outputs, provide scores (for train and stream data), and report which model achieves the best performance.

Remember that you can save a trained model and load it later.

Note: Getting the good scores on the classification task is nice, but it is not the goal of the current assignment. Implementing your own algorithm that performs poorly on test data can have greater value than getting a good score by calling model.fit(data).

Preparing Results for Output

Monitor the Twitter stream throughout a day, collect tweets, classify, and store the results. After that, you need to evaluate your classifier's correctness manually. Look through collected data and manually verify the correctness of the classifier (take a subsample of the twitter stream and estimate the actual classification quality). Provide some estimates for precision and recall for your classifier.

The output files should contain a log of incoming data and decisions made in the following format (CSV). (One file for each model).

Time, Input String 1, class
Time, Input String 2, class

You need to store all output files in your groups' folder on HDFS. Describe the data formats used in your log files in your report.

Writing the Report

The results of your teamwork should be compiled into a report. A helpful guide on how to write a report can be found here. The report should contain information about what you were doing, how you did it, description and motivation for your original solution, and the result.

The minimal report should contain:

Description of the team and responsibilities of team members
Description of the problem (briefly, what you need to implement)
Solution for the problem, including your preprocessing architecture, classification architecture, selection of hyperparameters (the better the description - the more points you get). Don't forget that you need to describe the results for both selected models.
What can be improved in your implementation? Be critical of your work.
Description of the testing procedure for classification quality. How did you collect data for testing? What was the testing procedure?
An example of the output of your program, as well as a description of your output files

The report should be understood even by a person who did not read the current document.

Troubleshooting

Classpath

Some signed libraries cannot be included in jar container. Try specifying classpath for spark-submit.