Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches

# Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches ### For the course CS4240-Deep Learning at the TU Delft we will try to reproduce and expand on the findings of the paper "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" by Maurizio Ferrari et al. We present our findings in this blog. ### *Group 45* - Albert Solà Roca (5458676, A.solaroca@student.tudelft.nl), Theodoros Veneti (5527805, T.Veneti@student.tudelft.nl), Ronan Hochart (5598621, r.g.s.g.hochart@student.tudelft.nl) ## Introduction The advances in research in Computer Science, especially in the areas of Deep Learning and Recommender Systems is clearly suffering of a reproductibility crisis. Several recent publications point out problems in today’s research practice in applied machine learning. Given the increased interest in machine learning in general, the corresponding number of recent research publications, and the success of deep learning techniques in other felds like vision or language processing, one could expect that substantial progress resulted from these works also in the feld of recommender systems. However, indications exist in other application areas of machine learning that the achieved progress—measured in terms of accuracy improvements over existing models—is not always as strong as expected. The article "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" tries to answer two questions. First of all, to what extent is recent research in the area of recommender systems reproducible (within reasonable effort)? And, to what extent are recent algorithms actually lead to better performance results when compared to relatively simple, but well-tuned, baseline methods? This project is, therefore, a reproductibility project for a reproductibility article. Our main aim for this project is to evaluate the results obtained by the authors, together with trying to expand on some of the articles they weren't able to reproduce. This blogpost will first explain the methodology the article used, followed by our own methods to reproduce their results and, finally, we will show the results obtained. ## Article Methodology The article obtained 18 relevant long papers from different conferences such as KDD, SIGIR, TheWebConfig (WWW) and RecSys which contained a deep learning based technique and focused on the top-n recommendation problem. Because of this interest in the top-n recommendation problem, the also only focused on papers containing ranking metrics such as Precision, Recall or MAP. The article proceeds to reproduce (and replicate) some of these 18 papers by obtaining the code and the data for all relevant papers from the authors and considered an article to be reproducible if a working version of the source code is available or the code only has to be modifed in minimal ways to work correctly and at least one dataset used in the original paper is available. A further requirement here is that either the originally-used train-test splits are publicly available or that they can be reconstructed based on the information in the paper. From this, the article finds that 7/18 (39%) of all the evaluated papers were reproducible. The other question the paper tried to answer was to what extent are recent algorithms actually lead to better performance results when compared to simple baselines. For this, the article used the following methods: * TopPopular * ItemKNN * UserKNN * ItemKNN-CBF * ItemKNN-CFCBF * P3$\alpha$ * RP3$\beta$ ### Deep Learning Methods Given the amount of deep learning methods reproduced, we only give a short description of them. The article was able to reproduce the following deep learning methods: #### Collaborative Memory Networks (CMN) [1] The CMN method was presented at SIGIR ’18 and combines memory networks and neural attention mechanisms with latent factor and neighborhood models. #### Metapath based Context for RECommendation (MCRec) [2] MCRec, presented at KDD ’18, is a meta-path based model that leverages auxiliary information like movie genres for top-n recommendation. #### Collaborative Variational Autoencoder (CVAE) [3] The CVAE method, presented at KDD ’18, is a hybrid technique that considers both content as well as rating information. #### Collaborative Deep Learning (CDL) [4] CDL is a probabilistic feed-forward model for joint learning of stacked denoising autoencoders (SDAE) and collaborative fltering. #### Neural Collaborative Filtering (NCF) [5] Neural network-based Collaborative Filtering, presented at WWW ’17, generalizes Matrix Factorization by replacing the inner product with a neural architecture that can learn an arbitrary function from the data. #### Spectral Collaborative Filtering (SpectralCF) [6] SpectralCF, presented at RecSys ’18, was designed to specifically address the cold-start problem and is based on concepts of Spectral Graph Theory. #### Variational Autoencoders for Collaborative Filtering (Mult-VAE) [7] Mult-VAE,presented at WWW ’18, is a collaborative fltering method for implicit feedback based on variational autoencoders. ## Our Methodology ### Reproductibility Maurizio et al. provided the following [Github](https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation) repository with all their code for the methods in the article. We were able to install the dependencies for GPU following the steps provided in the repository. Once everything was installed. This step already proved to be somewhat problematic, as some libraries, such as tensorflow and keras required of specific versions to be compatible to each other. To add to this problem, we found incompatibility with our Nvidia GeForce RTX 3060 and 3080 with the tensorflow and keras libraries, as they are somewhat depecrated. This meant we could only train and execute the models in a Nvidia GeForce GTX1080. That being said, we could execute most of the algorithms by runnning the following code: ``` python run_A_X_B.py -a True -b True -p True ``` Where A is the conference name (KDD, SIGIR,WWW,RecSys), X is the year of the conference and B is the algorithm name. The flags -a,-b and -p are to train the deep learning algorithm with the original hyperparameters, run baseline hyperparameter search and generate the latex tables for this experiment respectively. In addition to this, the method run_WWW_18_Mult_VAE.py required us to download a NetflixPrize dataset from the internet manually before executing. ### Extension For the second part of our project, we wanted to incorporate new results to the article. We inspected the articles evaluated in Table 1 of the article and searched one by one the non-reproducible articles to find some that had some code. We found an interesting article that we believe should be further explored to reproduce. The article "Convolutional Matrix Factorization for Document Context-Aware Recommendation" by Donghyun Kim et al. presented two algorithms that integrated convolutional neural network (CNN) into probabilistic matrix factorization (PMF). The inner workings of ConvMF are shown in the following image. ![Inner Workings of ConvMF](https://i.imgur.com/DA6dgAp.png) This article also extended the algorithm by using GloVe for the pre-trained word embedding model. This extension is called ConvMF+. We have obtained the code from their Github [repository](https://github.com/cartopy/ConvMF) and the data they use from their [website](http://dm.postech.ac.kr/~cartopy/ConvMF/). For this article we followed the [installation instructions](https://github.com/cartopy/keras-0.3.3) and were able to run the code. That being said, we needed some extra [preprocessed data](http://dm.postech.ac.kr/~cartopy/ConvMF/data/movielens.tar) from them. ## Results ### Reproductibility We were able to reproduce most of the work shown in the paper. Tables 3,4,6 and 7 from the original article (MCRec, CVAE, NeuMF and SpectralCF) all gave us the expected results, with variances of up to 0.05 in the metrics we were evaluating them. On the other hand, we weren't able to reproduce the following methods: * **CMN**: For CMN we were able to obtain the same baseline values than the article but not the main algorithm values due to library compatibility issues. * **CDL**: For CDL we were not able to obtain neither baselines nor the deep learning values again due to compatibility issues with the MATLAB engine library and the tensorflow/keras libraries used. * **Mult-VAE** : For Mult-VAE, the deep learning method was reproduced but we weren't able to reproduce the baselines. This is due to the baselines using SLIM, which is a Sparse linear method for top-n recommender systems but that took too long (>10 days) to train due to it using the CPU. This can be improved towards future versions of the code by using libraries that use the GPU to train these kinds of methods. The results found can be summarized in the following table: | Algorithm | Mehod | Baselines | | -------- | -------- | -------- | | CMN | Not Reproduced| Reproduced| | MCREC | Reproduced|Reproduced| | CVAE | Reproduced|Reproduced| | CDL | Not Reproduced|Not Reproduced| | NeuMF | Reproduced|Reproduced| | SpectralCF | Reproduced|Reproduced| | Mult-VAE | Reproduced|Not Reproduced| For all the tables of the reproduced algorithms and how they compare to the values obtained in the original article we refer to you to our [Github repository](https://github.com ) or this article's [Appendix](https://hackmd.io/@asolaroca/HJ5IVcHEq#Appendix). ### Extension We were also able to reproduce some of the results from the article “Convolutional Matrix Factorization for Document Context-Aware Recommendation”. That being said, it is important to note that we don't answer the question "To what extent are recent algorithms actually lead to better performance results when compared to relatively simple, but well-tuned, baseline methods?" for this article, as we believe that comparing the results to baselines was secondary to reproducing the paper. The methods we were able to reproduce were the two main methods presented by the paper in ConvMF and ConvMF+ as well as one baseline in PMF. Our results for the ConvMF methods were very close to the ones in the paper by Donghyun Kim et al. values and therefore deemed appropriate. This was expected as the original paper provided the original code used in a github repository as well as the data used to get their results. However for the PMF baseline very little information was provided, only a link to the original paper proposing the PMF method and no actual code or implementation. This meant we had to sample and test different implementations of PMF until we [found one](https://github.com/fuhailin/Probabilistic-Matrix-Factorization) that seemed appropriate. The key aspect of the PMF implementation that made us settle on it was the fact that the lambda V and U values were configurable and utilized in this implementation. This allowed us to set these values as the paper decribed in order to get accurate results. All the methods that we implemented on this paper were only used on the movielens 1M and 10M datasets. The paper also used a third data set named "AIV" but as our original paper already had used movielens we stuck with those two as a good representation of what the original researchers would have aimed to reproduce. While we werent able to fully reproduce this papers table, missing two of the three methods we were still able to reproduce the two new methods that this paper was focused on as well as a baseline to give it context. As the original researchers never explicitly said why they couldent reproduce this specific paper our only speculation is that they wanted more details on exactly which implementation of which methods were used for the baselines as the researchers didnt delve into specifics. For the tables reproduced in this extention they can also be found on our [Github repository](https://github.com/venetheo/Reproducibility-Project). ## Conclusion and Discussion The main objective of this project was to reproduce a reproductibility project, namely, the article "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches". We found most of the methods to be reproducible and the code provided by the authors to be of good quality. That being said, many compatibility issues with tensorflow and keras complicated the task to reproduce the paper. We were able to reproduce 4 of the 7 deep learning methods reproduced in the original paper, and we even able to partially reproduce a method which the paper named not-reproducible. We also believe that the other deep learning methods we weren't able to reproduce due to compatibility issues are indeed reproducible if the authors provided better requirements.txt files. ## Individual Contributions We initially divided the task of reproducing each table individually but in the end Theo was the one that reproduced most of the results because of his machine's hardware being compatible with the code. This means Ronan only reproduced the NeuMF table and Albert reproduced some baselines. Ronan also did the extension reproduction. Finally, Theo wrote the tables in the Github repository and Albert did the literature survey and the blog. Overall, we feel like work was correctly distributed and we split tasks more based on the amount of work the machines could do rather than who could do it. ## References [1] Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In Proceedings SIGIR ’18. 515–524. [2] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging meta-path based context for top-n recommendation with a neural co-attention model. In Proceedings KDD ’18. 1531–1540. [3] Xiaopeng Li and James She. 2017. Collaborative variational autoencoder for recommender systems. In Proceedings KDD ’17. 305–314. [4] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In Proceedings KDD ’15. 1235–1244. [5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative fltering. In Proceedings WWW ’17. 173–182. [6] Lei Zheng, Chun-Ta Lu, Fei Jiang, Jiawei Zhang, and Philip S. Yu. 2018. Spectral Collaborative Filtering. In Proceedings RecSys ’18. 311–319. [7] Dawen Liang, Rahul G Krishnan, Matthew D Hofman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings WWW ’18. 689–698. ## Appendix *MCREC* : Results from the MCRec method presented at KDD '18. *Dataet:Movie-Lens100k* | Method | PREC@10 | REC@10 | NDCG@10 | |------------|---------|--------|---------| | TopPopular | 0.1907 | 0.1180 | 0.1361 | | UserKNN | 0.2913 | 0.1802 | 0.2055 | | ItemKNN | 0.3327 | 0.2199 | 0.2603 | | P3$\alpha$| 0.2137 | 0.1585 | 0.1838 | | RP3$\beta$| 0.2357 | 0.1684 | 0.1923 | | MCRec | 0.3077 | 0.2061 | 0.2363 | *Our results* | Method | PREC@10 | REC@10 | NDCG@10 | |------------|---------|--------|---------| | TopPopular | 0.1907 | 0.1180 | 0.2184 | | UserKNN | 0.3370 | 0.2097 | 0.3917 | | ItemKNN | 0.3373 | 0.2235 | 0.4065 | | P3$\alpha$| 0.3361 | 0.2200 | 0.4073 | | RP3$\beta$| 0.3454 | 0.2230 | 0.4118 | | MCRec | 0.3047 | 0.2046 | 0.3617 | *CVAE* : Results from the CVAE method presented at KDD '18. *Dataset:CiteULike-a* | Method | REC@50 | REC@100 | REC@300 | |---------------|--------|---------|---------| | TopPopular | 0.0044 | 0.0081 | 0.0258 | | UserKNN | 0.0683 | 0.1016 | 0.1685 | | ItemKNN | 0.0788 | 0.1153 | 0.1823 | | P3$\alpha$| 0.0788 | 0.1151 | 0.1784 | | RP3$\beta$| 0.0811 | 0.1184 | 0.1799 | | ItemKNN-CFCBF | 0.1837 | 0.2777 | 0.4486 | | CVAE | 0.0772 | 0.1548 | 0.3602 | *Our results* | Method | REC@50 | REC@100 | REC@300 | |---------------|--------|---------|---------| | TopPopular | 0.0253 | 0.0389 | 0.0704 | | UserKNN | 0.0026 | 0.0053 | 0.0154 | | ItemKNN | 0.0026 | 0.0053 | 0.0154 | | P3$\alpha$| 0.0026 | 0.0053 | 0.0154 | | RP3$\beta$| 0.0026 | 0.0053 | 0.0154 | | ItemKNN-CFCBF | 0.0607 | 0.0634 | 0.0730 | | CVAE | 0.0779 | 0.1192 | 0.2202 | *NEUMF* : Results from the NeuMF method presented at WWW '17. *Dataset:Pinterest* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |---------------|--------|--------|--------|---------| | TopPopular | 0.1663 | 0.1065 | 0.2744 | 0.1412 | | UserKNN | 0.7001 | 0.5033 | 0.8610 | 0.5557 | | ItemKNN | 0.7100 | 0.5092 | 0.8744 | 0.5629 | | P3$\alpha$| 0.7008 | 0.5018 | 0.8667 | 0.5559 | | RP3$\beta$| 0.7105 | 0.5116 | 0.8740 | 0.5650 | | NeuMF | 0.7024 | 0.4983 | 0.8719 | 0.5536 | *Our results* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |---------------|--------|--------|--------|---------| | TopPopular | 0.1665 | 0.1064 | 0.2740 | 0.1409 | | UserKNN | 0.7042 | 0.5051 | 0.8656 | 0.5577 | | ItemKNN | 0.7126 | 0.5118 | 0.8778 | 0.5656 | | P3$\alpha$| 0.7016 | 0.5018 | 0.8687 | 0.5563 | | RP3$\beta$| 0.7139 | 0.5141 | 0.8775 | 0.5674 | | NeuMF | 0.7046 | 0.4994 | 0.8766 | 0.5556 | *Dataset:MovieLens-1m* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |------------|--------|--------|--------|---------| | TopPopular | 0.3043 | 0.2062 | 0.4531 | 0.2542 | | UserKNN | 0.4916 | 0.3328 | 0.6705 | 0.3908 | | ItemKNN | 0.4829 | 0.3328 | 0.6596 | 0.3900 | | P3$\alpha$| 0.4811 | 0.3331 | 0.6464 | 0.3867 | | RP3$\beta$| 0.4922 | 0.3409 | 0.6715 | 0.3867 | | NeuMF | 0.5486 | 0.3840 | 0.7120 | 0.4369 | | SLIM | 0.5589 | 0.3961 | 0.7161 | 0.4470 | *Our results* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |------------|--------|--------|--------|---------| | TopPopular | 0.3048 | 0.2064 | 0.4533 | 0.2542 | | UserKNN | 0.4921 | 0.3319 | 0.6714 | 0.3899 | | ItemKNN | 0.4914 | 0.3377 | 0.6624 | 0.3931 | | P3$\alpha$| 0.4687 | 0.3232 | 0.6430 | 0.3796 | | RP3$\beta$| 0.4947 | 0.3418 | 0.6694 | 0.3985 | | NeuMF | 0.5406 | 0.3807 | 0.7098 | 0.4356 | | SLIM | 0.5560 | 0.3939 | 0.7136 | 0.4450 | *SPECTRALCF*: Results from SpectralCF method presented at RecSys '18. *Dataset:MovieLens-1m* | Method | REC@20 | MAP@20 | REC@60 | MAP@60 | REC@100 | MAP@100 | |------------|--------|--------|--------|--------|---------|---------| | TopPopular | 0.1853 | 0.0576 | 0.3335 | 0.0659 | 0.4244 | 0.0696 | | UserKNN CF | 0.2881 | 0.1106 | 0.4780 | 0.1238 | 0.5790 | 0.1290 | | ItemKNN CF | 0.2819 | 0.1059 | 0.4712 | 0.1190 | 0.5737 | 0.1243 | | P3$\alpha$ | 0.2853 | 0.1051 | 0.4808 | 0.1195 | 0.5760 | 0.1248 | | RP3$\beta$ | 0.2910 | 0.1088 | 0.4882 | 0.1233 | 0.5884 | 0.1288 | | SpectralCF | 0.1843 | 0.0539 | 0.3274 | 0.0618 | 0.4254 | 0.0656 | *Our results* | Method | REC@20 | MAP@20 | REC@60 | MAP@60 | REC@100 | MAP@100 | |------------|--------|--------|--------|--------|---------|---------| | TopPopular | 0.1941 | 0.0611 | 0.3399 | 0.0693 | 0.4299 | 0.0729 | | UserKNN CF | 0.3013 | 0.1219 | 0.4870 | 0.1349 | 0.5851 | 0.1400 | | ItemKNN CF | 0.2972 | 0.1180 | 0.4864 | 0.1309 | 0.5847 | 0.1361 | | P3$\alpha$ | 0.3025 | 0.1174 | 0.4981 | 0.1316 | 0.5935 | 0.1371 | | RP3$\beta$ | 0.3101 | 0.1225 | 0.5071 | 0.1369 | 0.6058 | 0.1424 | | SpectralCF | 0.1619 | 0.0488 | 0.3132 | 0.0567 | 0.4037 | 0.0598 | *MVAE* : Results from the Mult-VAE method presented at WWW '18. *Dataset:Netflix* | Method | REC@20 | NDCG@20 | REC@50 | NDCG@50 | REC@100 | NDCG@100 | |------------|--------|---------|--------|---------|---------|----------| | TopPopular | 0.0782 | | 0.1643 | | | 0.1570 | | ItemKNN CF | 0.2088 | | 0.3386 | | | 0.3086 | | P3$\alpha$ | 0.1977 | | 0.3346 | | | 0.2967 | | RP3$\beta$ | 0.2196 | | 0.3560 | | | 0.3246 | | SLIM | 0.2551 | 0.2473 | 0.3995 | 0.3196 | 0.5289 | 0.3745 | | Mul-VAE | 0.2626 | 0.2448 | 0.4138 | 0.3192 | 0.5476 | 0.3756 | *Our results* | Method | REC@20 | NDCG@20 | REC@50 | NDCG@50 | REC@100 | NDCG@100 | |------------|--------|---------|--------|---------|---------|----------| | TopPopular | 0.0782 | | 0.1643 | | | 0.1570 | | ItemKNN CF | 0.2088 | | 0.3386 | | | 0.3086 | | P3$\alpha$ | 0.1977 | | 0.3346 | | | 0.2967 | | RP3$\beta$ | 0.2196 | | 0.3560 | | | 0.3246 | | SLIM | 0.2551 | 0.2473 | 0.3995 | 0.3196 | 0.5289 | 0.3745 | | Mul-VAE | 0.2641 | 0.3163 | 0.4174 | 0.3417 | 0.5508 | 0.3829 | *CDL* : Results from the CDL method presented at KDD '15. *CMN* : Results from the CMN method presented at SIGIR '18. *Dataset:CiteUlike-a* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |------------|--------|--------|--------|---------| | TopPopular | 0.1803 | 0.1220 | 0.2783 | 0.1535 | | UserKNN | 0.8213 | 0.7033 | 0.8935 | 0.7268 | | ItemKNN | 0.8116 | 0.6939 | 0.8878 | 0.7187 | | P3$\alpha$| 0.8202 | 0.7061 | 0.8901 | 0.7289 | | RP3$\beta$| 0.8226 | 0.7114 | 0.8941 | 0.7347 | | CMN | 0.8069 | 0.6666 | 0.8910 | 0.6942 | *Dataset:Pinterest* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |------------|--------|--------|--------|---------| | TopPopular | 0.1668 | 0.1066 | 0.2745 | 0.1411 | | UserKNN | 0.6886 | 0.4936 | 0.8527 | 0.5470 | | ItemKNN | 0.6966 | 0.4994 | 0.8647 | 0.5542 | | P3$\alpha$| 0.6871 | 0.4935 | 0.8449 | 0.5450 | | RP3$\beta$| 0.7018 | 0.5041 | 0.8644 | 0.5571 | | CMN | 0.6872 | 0.4883 | 0.8549 | 0.5430 | *Dataset:Epinions* | Method | HR@5 | NDCG@5 | HR@10 | NDCG@10 | |------------|--------|--------|--------|---------| | TopPopular | 0.5429 | 0.4153 | 0.6644 | 0.4547 | | UserKNN | 0.3506 | 0.2983 | 0.3922 | 0.3117 | | ItemKNN | 0.3821 | 0.3165 | 0.4372 | 0.3343 | | P3$\alpha$| 0.3510 | 0.2989 | 0.3891 | 0.3112 | | RP3$\beta$| 0.3511 | 0.2980 | 0.3892 | 0.3103 | | CMN | 0.4195 | 0.3346 | 0.4953 | 0.3592 | *Extension ConvMF* : Results attempting to reproduce the paper by Donghyun Kim et al (One of the papers, the work by Ferrari could not reproduce). *Dataset:MovieLens1M* | Method | RMSE | |------------|--------| | PMF | 0.8971 | | ConvMF | 0.8531 | | ConvMF+ | 0.8549 | *Our results* | Method | RMSE | |------------|--------| | PMF | 0.8731 | | ConvMF | 0.8569 | | ConvMF+ | 0.8570 | *Dataset:MovieLens10M* | Method | RMSE | |------------|--------| | PMF | 0.8311 | | ConvMF | 0.7958 | | ConvMF+ | 0.7958 | *Our results* | Method | RMSE | |------------|--------| | PMF | 0.8668 | | ConvMF | 0.7889 | | ConvMF+ | 0.7863 |