Spark, Recommender System

The goal of this task is to finalize the implementation of a movie recommendation system. In the process, you will get more experience with programming in Scala and working with RDDs.

Dataset: Modified MovieLens dataset (Mirror), also available in HDFS hdfs://namenode:9000/movielens-mod
Project Template: Download the project and try compiling it. Compilation should complete without errors. (Mirror)
Extra files: Download list of movies for collecting user preferences

Inspecting and Running The Template

Inspect the project template content.

spark-recommendation
├── build.sbt
├── for-grading.tsv
└── src
    ├── main
    │   └── scala
    │       ├── Grader.scala
    │       └── MovieLensALS.scala
    └── test
        └── scala

build.sbt contains build configuration and dependencies. Make sure that the correct spark and scala (read the documentation or run spark-submit --version) versions are specified here
for-grading.tsv this file is used for collecting user ratings
Grader.scala for reading user preferences
MovieLensALS.scala contains main class for training recommendation system

Try to compile this project with sbt or create an artifact with your IDE. When using IDE, make sure all dependencies are in MANIFEST. More information about this in the Appendix.

Try to run application with

spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user false

Data Description

The dataset includes two files: movies2.csv and ratings2.csv. The first contains the list of movie titles in the format

movieId,title,genres

for example

72,Kicking and Screaming (1995),Comedy|Drama
73,"Misérables, Les (1995)",Drama|War

The ratings data are stored in the second file and have the format

userId,movieId,rating,timestamp

for example

1,2253,3.5,1112486122
1,2288,4.0,1094786077

The data is also available in HDFS at /movielens-mod.

General description

Your goal is to implement a movie recommendation system. The system takes data in the format Rating(userId, movieId, rating), and tries to learn a model of a user, based on graded movies. You need to format the data as needed using RDD transformations. For more information about class Rating read here.

In Spark, recommendation system can be built with the ALS (Alternating Least Squares) class that has simple train() and predict() interfaces.

The predictor takes the data in format (userId, movieId) and returns Rating(userId, movieId, rating). To make the results human-readable, we need an additional structure that stores the mapping of movieId to movieTitle

To evaluate the benefit of this system, we compare the result with a baseline. The baseline for this task is the average rating for a given movie. We need a data structure that stores the mapping of movieId to averageRating.

Read the code. Understand how the program works.

Complete the code

Implement parseTitle and rmse(test: RDD[Rating], prediction: scala.collection.Map[Int, Double])

spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user false

Run with Your Movie Preferences

spark/bin/spark-submit --master yarn spark-recommendation.jar hdfs:///movielens-mod -user true

executing this command will allow you to specify your own movie preferences and get recommendations after the model has finished training.

The dataset is not very large. You can run the application locally if you have sufficient resources.

Post-processing of Recommendations

Currently, the list of recommendations can include the movies that the user has already graded. Modify the program such that these movies are filtered. You can use either set difference or RDD's filter methods.

Load Your Movie Preferences

Your movie preferences were saved into the file user_rating.tsv. Write a method or class that will load this data instead of surveying the user every time.

Change the Rank of the Model

Try different model ranks. A higher rank leads to a more complex model. Compare the difference between baseline and prediction for a model with a higher rank. Evaluate the quality of the proposed movies subjectively.

Extra Filtering

You might have noticed that the recommendations are not very great. The reason is that the model cannot compute confident representations for infrequent movies. Modify the program in such a way that low-frequency items (movies with less than 50-100 ratings) are excluded from the training and recommendation process.

Reference

Help with Scala
Scala Book
RDD partitioning 1
Stack overflow during execution 1
History Server 1

Self-Check Questions

What does sc.parallelize do?
What does collectAsMap do?
What is the difference between foreach and map?
What is pattern matching in Scala?

Troubleshooting

Sbt compilation error

There are issues compiling scala libraries with Java 9 and above. If you see an error during compilation, try downgrading to Java 8.

Missing Winutils

There is a known issue with Spark on Windows, that it requires winutils.exe. The solution can be found here.

Memory Issues

If you experience issues with memory, try to partition dataset into larger chunks

val data = data.repartition(number_of_partitions)

Creating more partitions is equivalent to reducing the size of input data per partition, and reducing memory requirements for map tasks.

Appendix

Including Dependencies

When you configure artifact with IDE, you need to specify the dependencies. There is absolutely no need to include all the dependencies in the jar container, and you should merely add dependencies to the manifest.