The goal of this task is to finalize the implementation of a movie recommendation system. In the process, you will get more experience with programming in Scala and working with RDDs.
Dataset: Modified MovieLens dataset (Mirror), also available in HDFS hdfs://namenode:9000/movielens-mod
Project Template: Download the project and try compiling it. Compilation should complete without errors. (Mirror)
Extra files: Download list of movies for collecting user preferences
Inspect the project template content.
build.sbt
contains build configuration and dependencies. Make sure that the correct spark and scala (read the documentation or run spark-submit --version
) versions are specified herefor-grading.tsv
this file is used for collecting user ratingsGrader.scala
for reading user preferencesMovieLensALS.scala
contains main class for training recommendation systemTry to compile this project with sbt or create an artifact with your IDE. When using IDE, make sure all dependencies are in MANIFEST. More information about this in the Appendix.
Try to run application with
The dataset includes two files: movies2.csv
and ratings2.csv
. The first contains the list of movie titles in the format
for example
The ratings data are stored in the second file and have the format
for example
The data is also available in HDFS at /movielens-mod
.
Your goal is to implement a movie recommendation system. The system takes data in the format Rating(userId, movieId, rating)
, and tries to learn a model of a user, based on graded movies. You need to format the data as needed using RDD transformations. For more information about class Rating
read here.
In Spark, recommendation system can be built with the ALS
(Alternating Least Squares) class that has simple train()
and predict()
interfaces.
The predictor takes the data in format (userId, movieId)
and returns Rating(userId, movieId, rating)
. To make the results human-readable, we need an additional structure that stores the mapping of movieId
to movieTitle
To evaluate the benefit of this system, we compare the result with a baseline. The baseline for this task is the average rating for a given movie. We need a data structure that stores the mapping of movieId
to averageRating
.
Read the code. Understand how the program works.
Implement parseTitle
and rmse(test: RDD[Rating], prediction: scala.collection.Map[Int, Double])
executing this command will allow you to specify your own movie preferences and get recommendations after the model has finished training.
The dataset is not very large. You can run the application locally if you have sufficient resources.
Currently, the list of recommendations can include the movies that the user has already graded. Modify the program such that these movies are filtered. You can use either set difference or RDD's filter methods.
Your movie preferences were saved into the file user_rating.tsv
. Write a method or class that will load this data instead of surveying the user every time.
Try different model ranks. A higher rank leads to a more complex model. Compare the difference between baseline and prediction for a model with a higher rank. Evaluate the quality of the proposed movies subjectively.
You might have noticed that the recommendations are not very great. The reason is that the model cannot compute confident representations for infrequent movies. Modify the program in such a way that low-frequency items (movies with less than 50-100 ratings) are excluded from the training and recommendation process.
sc.parallelize
do?collectAsMap
do?foreach
and map
?There are issues compiling scala libraries with Java 9 and above. If you see an error during compilation, try downgrading to Java 8.
There is a known issue with Spark on Windows, that it requires winutils.exe. The solution can be found here.
If you experience issues with memory, try to partition dataset into larger chunks
Creating more partitions is equivalent to reducing the size of input data per partition, and reducing memory requirements for map tasks.
When you configure artifact with IDE, you need to specify the dependencies. There is absolutely no need to include all the dependencies in the jar container, and you should merely add dependencies to the manifest.