# 1 Response to All Reviewers and Additional Experiment Results
**All data figures and tables are available at this [anonymous link](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/).**
We sincerely appreciate the valuable feedback provided by the reviewers, which helps us improve the quality of our paper. Based on the comments, we constructed the following new experiments that can be used to improve the paper, including those with new datasets, new data preprocessing, new RS model, and also those about consistency with the published results.
Before presenting the results of the additional experiments, we want to clarify first that though it is ideal to provide a comprehensive evaluation of as many of the existing KG-based RS models and datasets (regarding various performance metrics) as possible, the work in this paper actually does not aim at this, which can hardly fit within page limit of a paper (our appendix takes 20 pages). Rather, we aim to benefit the field mainly with the following contributions. First, we provide a novel perspective to systematically investigate whether KGs are really as useful in the existing KG-based RSs as one would expect, and how to quantify their role. As far as we know, there barely exists a similar work in the literature systematically researching this question. Second, the experiments with those relatively more influential RS models deliver multiple counter-intuitive results, calling for designers of KG-based RSs to reflect on their attempts to exploit KGs more efficiently. These may also trigger thinking about whether other types of side information (e.g., social network info) are efficiently used as expected in RSs. Moreover, the entire systematic evaluation framework and the metric KGER are not confined to analyzing KG-based RSs and are actually highly scalable to be applied where necessary, e.g., for consistently evaluating emerging RS models. As we pointed out in Section 5, they can naturally be generalized to analyze the role of other types of side information in supporting various kinds of performance of a system (e.g., may not necessarily be recommendation accuracy).
## 1.1 New Datasets
First, as suggested by Reviewer DN8B, we experimented with the two new smaller datasets Last.FM and Book-Crossing datasets released in RippleNet[1] and KGCN[2]. For the Book-Crossing dataset, we convert the interactions with ratings no less than 4 as positive feedback. For the Last.FM dataset, we do not perform any further data processing following the instruction of [2] other than converting it to the RecBole dataset format. The information of the new datasets is presented in Table 1.
All the four settings (in RQ1-4) are used and the results are in Figure 1-3 and Table 2.
With Last.FM, in both false knowledge and decreasing knowledge experiments, MRR of all the models does not necessarily decrease with more knowledge either randomly distorted (Figure 1a) or decreased (Figure 2a), rather, they tend to fluctuate within 0.05. With Book-Crossing, MRR of all the models except for KGIN, KGCN and MCCLK remain to fluctuate (no monotonicity) in both the settings. The MRR of CFKG even increases by 3.2\% when the KG of Book-Crossing gets fully randomized. MRR of KGIN, KGCN and MCCLK tend to decrease with more knowledge either randomly distorted or decreased. This implies that KG of Book-Crossing does help with the accuracy of these three models, but not for all the other five. Overall, the results on the two new datasets remain consistent with our previous conclusion, that is to remove, randomly distort, or decrease knowledge does not necessarily decrease recommendation accuracy.
[Table 1 Statistics of new datasets](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_table1.png)
### 1.1.1 No knowledge experiment
[Table 2 No knowledge experiment of new datasets](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_table2.png)
### 1.1.2 False knowledge experiment
[Figure 1 False knowledge experiment of new datasets](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure1.png)
### 1.1.3 Decreasing knowledge experiment
[Figure 2 Decreasing knowledge experiment of new datasets](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure2.png)
### 1.1.4 Cold-start experiment
[Figure 3 Cold-start experiment of new datasets](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure3.png)
## 1.2 New Data Preprocessing
Second, we reprocess the two used datasets (MovieLens-100k and Amazon-Books datasets) to make them less filtered (as suggested by both 7rbw and DN8B). For MovieLens-100K dataset, we convert the interactions with ratings no less than 4 as positive feedback. For Amazon-Books dataset, we convert the interactions with ratings no less than 4 as positive feedback and recursively filter out users and items with interaction numbers less than 30 instead of 300 (as done in the submission).
The results are presented in Figure 4-6 and Table 4. It can be observed that the overall conclusions with the new dataset processing remain consistent with the previous processing, that is, to remove, randomly distort, or decrease knowledge does not necessarily decrease MRR, which tends to fluctuate in all the cases ranging from results in Table 4 and Figure 4-6.
[Table 3 Statistics of datasets with new data preprocessing](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_table3.png)
### 1.2.1 No knowledge experiment
[Table 4 No knowledge experiment of datasets with new data preprocessing](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_table4.png)
### 1.2.2 False knowledge experiment
[Figure 4 False knowledge experiment of datasets with new data preprocessing](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure4.png)
### 1.2.3 Decreasing knowledge experiment
[Figure 5 Decreasing knowledge experiment of datasets with new data preprocessing](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure5.png)
### 1.2.4 Cold-start Experiment
[Figure 6 Cold-start Experiment of datasets with new data preprocessing](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_figure6.png)
### 1.3 New Recommendation Model
Third, w.r.t. the concern of both reviewer AN27 and reviewer 7rbw, we introduce a new RS model, MCCLK[3], published in 2022 at SIGIR. The results are presented in Figure 1-6, Table 2 and Table 4, shown in sections New Dataset and New Data Preprocessing. It can be observed that the performance of MCCLK is a bit similar to KGIN in our experiments. For the Book-Crossing dataset, its MRR decreases as the authenticity and amount of facts decrease, indicating that it can effectively utilize the KG of Book-Crossing. However, for all the other datasets MovieLens-100K, Amazon-Books, and Last.FM, its MRR does not correlate much with both the authenticity and amount of facts in KGs. For example, the MRR of MCCLK even increases by 6.5\% with Amazon-Books when facts are fully randomized. The new results generally reinforce our conclusion that the role of KG is highly dependent on both the dataset and RS.
## 1.4 Consistency with Published Results
Fourth, w.r.t the concern of reviewer 7rbw, we conducted experiments on the MovieLens-100K and MovieLens-1M datasets based on the hyperparameter settings outlined in the mentioned KGCN paper to evaluate the performance of KGCN, RippleNet and CKE with MRR as the metric. The result is in Table 5, remaining consistent with the published paper, showing the same performance order KGCN \> RippleNet \> CKE on both datasets. Due to time and computational resource constraints, we did not extend the experiments to the MovieLens-20M dataset.
[Table 5 Comparative results of RS performance (MRR) under hyperparameter settings in the KGCN paper](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/all_table5.png)
The observed discrepancy in performance could be attributed to differences in hyperparameter settings between our study and the KGCN paper. Notably, our experiments indicate that, under our specific hyperparameter configurations, CKE exhibited superior performance compared to the results under the hyperparameter configurations of KGCN paper. This observation suggests that our hyperparameter settings are more conducive to optimal model performance in our experimental conditions.
## Reference
[1] Wang, Hongwei, et al. "Ripplenet: Propagating user preferences on the knowledge graph for recommender systems." Proceedings of the 27th ACM international conference on information and knowledge management. 2018.
[2] Wang, Hongwei, et al. "Knowledge graph convolutional networks for recommender systems." The world wide web conference. 2019.
[3] Zou, Ding, et al. "Multi-level cross-view contrastive learning for knowledge-aware recommender system." Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.
# 2 To reviewer 7rbw
Thank you for your insightful and valuable feedback. We appreciate your acknowledgement of the novelty of our work. And we improved the work with additional experiments following your suggestions.
**1. Did you attempt to run at least a subset of the experiments on less filtered datasets? Were the results consistent with Table 2?**
We want to first clarify that the LFM-1b[1] dataset used in our paper is distinct from Last.FM. The LFM-1b consists of over 1 billion music listening events, which necessitates more preprocessing.
Regarding your concern about the dataset preprocess, we took additional experiments to make the results more convincing. We employed more reasonable preprocess for the MovieLens-100K and Amazon-Books datasets, and we also experimented with two additional datasets, Last.FM and Book-Crossing. The details of these new experiments are presented in the overall response, New Dataset and New Data Preprocessing sections. And the results obtained are generally consistent with our original conclusion.
**2. Evaluation metric MRR**
As outlined in Section 4.1, we utilized various metrics, including NDCG, Recall, and others. Due to space limit of the paper, we opted to present the detailed results obtained with these additional metrics, for which the results remain consistent with that using MRR, in the Appendix D-G. the results with which remain consistent as with MRR.
**3. Validate implementations and experimental results against numbers in the published literature? For instance, the KGCN paper finds that KGCN \> RippleNet \> CKE on MovieLens (20M). On the filtered version of MovieLens 100k, this submission finds CKE \> KGCN \> RippleNet.**
Regarding your concern about the consistency with published results, we conducted additional experiments on the MovieLens-100K and MovieLens-1M datasets based on the hyperparameter settings outlined in the mentioned KGCN paper. The details of new experiments are presented in the overall response, Table 5 of Consistency with Published Results section. The results remain consistent with the KGCN paper, showing the same performance order KGCN \> RippleNet \> CKE on both datasets. And the observed discrepancy in performance could be attributed to differences in hyperparameter settings between our study and the KGCN paper.
**4. Are there no appropriate baselines published since 2020 other than KGIN?**
We experimented on a new model, MCCLK[2], published in 2022 at SIGIR. The detailed experiments and the results are provided in the Additional Experiment Results in the overall response. The results show that KG of Book-Crossing seems to help with MRR of MCCLK, but for all the other datasets, their KGs barely influences the performance (details in overall response, Section New Recommendation Model Figures 1-6, Table 2 and Table 4). The choice of KG-based RSs in the submission is mainly because they are relatively more influential, representative ones recognized by the field.
Table 1 lists their high citations, with data from Google Scholar. Due to space limit, we will consider other emerging KG-based RS models (some may be open-source but not published yet) in the future work.
[Table 1 Citations of RSs used in our submission](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/citation.png)
## Reference
[1] Schedl, Markus. "The lfm-1b dataset for music retrieval and recommendation." Proceedings of the 2016 ACM on international conference on multimedia retrieval. 2016.
[2] Zou, Ding, et al. "Multi-level cross-view contrastive learning for knowledge-aware recommender system." Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.
# 3 To reviewer AN27
Thank you for the valuable feedback and recognition of our work. Below we clarify your concerns and answer the questions.
**1. Limited scope in KG exploration:**
While it is interesting to explore the role of KGs in other aspects of recommendation like explainability or diversity, we focus on accuracy in this paper since it is usually the primary task (the most important property) to be considered when designing a RS. And even for considering the accuracy alone, systematic analysis and extensive experiments are required such that not all results can fit within the page limit (our appendix takes around 20 pages). For recommendation properties except accuracy, we are open to study in the future work. It's also worth noting that as far as we know, there exists little research studying how KGs influence performance properties of KG-based RSs, even if only accuracy property is considered.
**2. Outdated KG models:**
The KG-based recommendation models we choose are those relatively more influential, representative ones recognized by the domain of RSs, which can also ensure us a reliable benchmark in the analysis. Table 1 lists their high citations, with data from Google Scholar.
Nevertheless, we experimented with a new RS model MCCLK[1], published in 2022 at SIGIR. And its performance does not necessarily decrease with less knowledge in KGs (detailed results are in overall response, Section New Recommendation Model Figures 1-6, Table 2 and Table 4).Our proposed evaluation framework and metric are scalable to consistently be used to more emerging KG-based RS models, which we'd like to explore in future.
[Table 1 Citations of RSs used in our submission](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/citation.png)
**3. Potential overgeneralization:**
We agree it is ideal to evaluate with all of the KG-based RS models and datasets. But due to the limitations in space and computing resources, in practice, we include in total of 8 models (and 9 with the new model MCCLK) across three different types of RSs. Our results (together with those in the additional experiments in the overall response) are representative to certain extent. And we think they are sufficient to call for attention to reinspect the efficient use of KGs, which is our main purpose (instead of pursuing comprehensiveness as we highlighted in the overall response).
**4. The detailed statistics of the used KGs:**
The number of relations and entities can be found in Table 1 in Section 4.1 of the submission.
**5. Is the used KG covers enough external/unfound knowledge?**
Consider the MovieLens-100K dataset as an example. The KG encompasses 24 different types of relations, including information about actors, countries, movie genres, directors, languages, and more. It comprises a total of 34,628 entities and encompasses an extensive 91,631 facts. Notably, the average number of facts associated with each item in the dataset is 57. These imply the richness and diversity of the KG.
## Reference
[1] Zou, Ding, et al. "Multi-level cross-view contrastive learning for knowledge-aware recommender system." Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.
# 4 To reviewer ntTN
Thank you very much for the valuable and encouraging feedback. Below, we answer the posed questions.
**1. The authors provide a few discussion points of why they get these surprising results, e.g., missing alignment between KG And user preferences. Can authors comment on this in more detail?**
For example, in the context of the MovieLens dataset, the KG contains information about "costume designer", which may be less relevant to user movie preferences. But such information may sometimes introduce noise, intensifying overfitting problem of neural networks and accordingly causing recommendations deviated from actual user preferences.
**2. Is there a discrepancy between a general (static) knowledge and a more specific (dynamic) signal from the user-item interactions? Does this discrepancy make KG obsolete for recommender systems?**
We appreciate your valuable insight. The existing KGs provides facts which are supposed to be always true. If a KG remain fixed or its update cannot catch up with the consistently evolving number of users/items in a RS and their preferences, then the facts it includes may be insufficient (or lag behind) to support the recommendation.
**3. What is the connection between content-based recommendations and KG-based recommendations? Would KG be more useful for content-based recommendations?**
This is an interesting problem. The KG-based recommender systems we study belong to collaborative filtering based recommender systems. They typically include two types of input, namely user-item interaction matrix and KGs as side information, of which the existing ones usually consist of item attribute information. The input of the existing content-based RSs mainly includes item attribute information related to each user [1, 2]. The knowledge graph could serve as a more comprehensive source of information in content-based RSs, replacing the attributes of items, which may lead to an increase in the accuracy of RSs.
## Reference
[1] Adomavicius, Gediminas, and Alexander Tuzhilin. "Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions." IEEE transactions on knowledge and data engineering 17.6 (2005): 734-749.
[2] Javed, Umair, et al. "A review of content-based and context-based recommendation systems." International Journal of Emerging Technologies in Learning (iJET) 16.3 (2021): 274-306.
# 5 To reviewer DN8B
Thank you for your careful and constructive review. Below, we answer the posed questions and clarify the concerns with the additional experiment results.
**1. Concern with dataset pre-process**
We appreciate your concern and to make our conclusions more convincing, we conducted new experiments for which the details are in the overall response.
First, we experimented with additional two smaller datasets Last.FM and Book-Crossing (RQs) as you suggested. The results deviate little from the main conclusion in the submission, namely to remove, randomly distort or decrease knowledge does not necessarily decrease MRR (Refer to Figures 1-3 and Table 2).
Second, we updated pre-process of the originally used datasets MovieLens-100k and Amazon-Books datasets to make it more reasonable (LFM-1b has not been updated yet due to time limit).
For the MovieLens-100K dataset, we convert the interactions with ratings no less than 4 as positive feedback. For Amazon-Books dataset, we convert the interactions with ratings no less than 4 as positive feedback and recursively filter out users and items with interaction number less than 30 instead of deleting those with interactions less than 300. The results show that generally our conclusion remains unchanged: to remove, randomly distort or decrease knowledge does not necessarily decrease recommendation accuracy (Refer to Figures 4-6 and Table 4).
**2. MovieLens-1M more appropriate and why not to use it**
We are currently working on experimenting with MovieLens-1M. Previous choice of ML100K was to ensure the feasibility of completing experiments within a reasonable timeframe. This dataset strikes a balance between being relatively small and yet representative, allowing us to efficiently conduct our experiments without compromising the quality and integrity of our research.
**3. No source code**
We have already provided the source code (the [anonymous link](https://anonymous.4open.science/r/KGRecRole-CCC0/) in Appendix A of the paper).
**4. Why not to replace KGs facts (RQ1) with self-facts as done in RQ3?**
Following your suggestion, we ran a new experiment by deleting 100\% ratio of KGs facts and only let self-facts remain to re-investigate RQ1.
[Table 1 Comparison of the No knowledge setting with the Self-fact setting](https://anonymous.4open.science/r/KGRecRole-CCC0/rebuttal_figure/DN8B_table1.png)
The results demonstrate that replacing KG facts with self-facts has a similar effect as replacing the original KGs with interactions (a straightforward way used previously) in most cases. This further reinforces our original conclusion for RQ1: The recommendation accuracy of a KG-based RS does not necessarily decrease when the KG is replaced by the user-item interaction graph.
**5. Is the metric KGER the only "research" contribution of this paper?**
Our apology that we do not agree with the opinion that only new definitions in a paper can be treated as its "research" contribution, ignoring other components such as the evaluation methods and the experimental findings.
Our research contribution extends much beyond the introduction of KGER. A detailed clarification of our contributions are presented at the beginning of the overall response. Briefly, the novel perspective of inspecting the actual role of KGs in RSs, the scalable evaluation framework and metric (for more emerging models, performance metrics and datasets), and more importantly, the counter-intuitive findings, are all our contributions which will benefit the field, calling for their attention to reinspect the efficient use of KGs. Moreover, our research emphasizes the need for evaluation and benchmarking of KG-based RSs, which aligns with the critical discourse on reproducibility and regorous assessment in the field of RS [1, 2].
## Reference
[1] Ferrari Dacrema, Maurizio, Paolo Cremonesi, and Dietmar Jannach. "Are we really making much progress? A worrying analysis of recent neural recommendation approaches." Proceedings of the 13th ACM conference on recommender systems. 2019.
[2] Sun, Zhu, et al. "Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison." Proceedings of the 14th ACM Conference on Recommender Systems. 2020.