Due Date: 8-th October 2019, 23:55
Teams: no less than three and no more than four students. Team representative should send list names to your TA
Rule for new team: no ex-teammates.
Assignment Details: Should be implemented in Scala. Cluster address is the same. Using this link, you can check the cluster status. In general, this task is not computationally intensive. (But we recommend you to perform hyperparameter search and cross-validation on the cluster)
Stream Address: 10.90.138.32:8989
Report: Read Non-Technical Guide to Writing a Report to understand how to present your work in the best way.
Submission Format: report, link to GitHub, compiled binaries, and an example of the output. Store your full outputs and your trained model in your groups' folder in HDFS.
Grading policy: Individual grade is based on the role in the team (contribution). The description of the personal contribution should be provided in the report.
In this Assignment, you are going to stream (sort of) tweets and analyze their sentiment using Spark Stream.
The processing pipeline is shown in the image above. Let's discuss each of the steps.
Spark Stream creates an abstraction over streams that process stream data almost identically to RDDs or DataFrames. The most common (easy to work with) stream format is Discretized Stream (DStream
). Alternatively, you can convert your stream to Spark DataFrame
and process it using SQL Operations. All the details of how to create a stream object, the list of transformations implemented for streams, ways to convert DStream
to DataFrame
are provided in the official programming guide. Read it through before beginning your work.
One of the common stream formats is DStream
containing lines. This format is almost identical to RDD[String]
. The list of available transformations for DStream
can be found here. Each entry in the stream is a line - e.g.
This line is a single tweet.
After receiving this line, you can apply a series of transformations. For our task, we are mostly interested in classifying tweets by sentiment.
Familiarize yourself with the format of the stream. We recommend that you turn to spark documentation. The preprocessing is as simple as writing several map functions. The goal of preprocessing is usually data normalization.
The goal of the sentiment classification task is to figure out positive and negative (or neutral) tweets.
The most widely used application is product review analysis. Sentiment analysis is one of the most common tools for assessing the perception of a company by its customers. In most of the cases, the problem of sentiment analysis can be reduced to classification into two or three classes. For more information about sentiment classification, read this note.
To train a sentiment classifier, one needs to collect a labeled dataset. There are several free datasets that you can use.
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
You can use one of these datasets to train your model or even merge these datasets to improve the results. But remember that the stream comes from twitter.
To get good results in the classification task you need to provide the algorithm with a suitable set of features. Read the documentation about methods available in Spark for feature extraction and feature transformation. Some useful methods are TF/IDF, CountVectorizer, FeatureHasher, Word2Vec, Tokenizer, StopWordRemover, n-gram, and other.
There are several ways to approach the classification problem. In this homework, you can either use standard Spark methods or implement your own classifier. There is a multitude of algorithms available in Spark. All you need to do is to select the ones useful for your task. Alternatively, you can implement your own algorithm. You need to try at least two algorithms from standard ones or implement one of them on your own.
You need to train and compare models using F1 scores. Then you need to run the models on the stream and provide to output files with your sentiments classes.
In the report, you should elaborate more on results - provide outputs, provide scores (for train and stream data), and report which model achieves the best performance.
Remember that you can save a trained model and load it later.
Note: Getting the good scores on the classification task is nice, but it is not the goal of the current assignment. Implementing your own algorithm that performs poorly on test data can have greater value than getting a good score by calling model.fit(data)
.
Monitor the Twitter stream throughout a day, collect tweets, classify, and store the results. After that, you need to evaluate your classifier's correctness manually. Look through collected data and manually verify the correctness of the classifier (take a subsample of the twitter stream and estimate the actual classification quality). Provide some estimates for precision and recall for your classifier.
The output files should contain a log of incoming data and decisions made in the following format (CSV). (One file for each model).
You need to store all output files in your groups' folder on HDFS. Describe the data formats used in your log files in your report.
The results of your teamwork should be compiled into a report. A helpful guide on how to write a report can be found here. The report should contain information about what you were doing, how you did it, description and motivation for your original solution, and the result.
The minimal report should contain:
The report should be understood even by a person who did not read the current document.
Some signed libraries cannot be included in jar container. Try specifying classpath for spark-submit
.