appealing letter
Dear Program Committee,
We are writing to formally appeal the handling of our submission by the assigned Senior Area Chair (SAC). It has come to our attention that our work was subject to highly biased comments after rebuttal phase, specifically during the SAC agreement period.
Specifically, new comments were added on April 19 and 23, which were after the rebuttal phase had officially closed. These comments included critiques that suggest a lack of understanding of the core concepts of our work:
1. One reviewer asserts that our work lacks value to the ICML community solely based on the fact that they had 'never heard of neuro-symbolic learning,' which is shockingly irresponsible, to say the least.
2. Another reviewer mistakenly concluded that we only addressed the i.i.d. case, having read only a portion of our work. They overlooked the fact that we commenced the discussion and proof with the i.i.d. case, while the entirety of our content was focused on the non-i.i.d. scenario.
We believe that these critiques are not valid grounds for rejection, as they reflect a misunderstanding of the material rather than issues with the work itself. Given that the SAC has access to the identities of both authors and reviewers during the agreement period, we are concerned about the fairness and impartiality of the review process.
We respectfully request that the Program Committee consider assigning a different SAC to reevaluate our submission, focusing on the substantive review comments and scores received initially.
Please let us know if further information is required from our side. We appreciate your attention to this matter and look forward to your response.
Sincerely,
Authors
## For AC/PC (new reviewer)
Dear AC/PC,
We am writing to express my deep disappointment with the feedback provided by the reviewer(6iRP). We regret that we received this review after the discussion period. We believe that if we were able to discuss with this reviewer, the misunderstanding on his part can be cleared. We therefore think that this situation needs to be taken into account in the final decision.
**In short, the emergency reviewer's comments lack responsibility as they fixate on a privacy deployment issue without recognizing our method's alignment with model parameter transmission. In addition, the assertion of our work lacking value to the ICML community solely based on the reviewer 'never heard of neuro-symbolic learning', which is shockingly irresponsible to say the least. Moreover, the commonly adopted distribution shift-based dataset construction in other works has been overlooked, dismissing our evaluation setup to be unfair. Lastly, the research setting of similar work listed by reviewer holds lesser value compared to ours, which demonstrates a lack of thorough assessment. These points highlight the need for a fair evaluation of our paper. The details can be found in the following responses:**
**Weakness 1 (Privacy):**
The paper mentions privacy a number of times. However, FL does not guarantee privacy, as shown by a long line of work on privacy attacks on FL systems. We say that FL is privacy compatible or even privacy amenable because FL allows us to achieve Differential Privacy (DP) guarantees through the use of DP-FL algorithms. For example, all Google Keyboard models that are currently deployed were trained with DP under FL. Without any formal privacy framework in place, the proposed algorithm cannot be said to preserve privacy. The motivation and contribution of the paper are therefore thin. The FL framework in place does not seem to provide any benefits over just aggregating the user data and creating a NSL model with the proposed V-EM method.
**Response to Weakness 1:**
We utilize a probabilistic model, implying that transferring probability estimates essentially involves transferring model parameters, aligning with other federated learning (FL) methodologies that share model weights. Additionally, it's crucial to recognize that the incorporation of Differential Privacy (DP) techniques represents an implementation issue, which is not the core research scope of our paper. The novelty and value of our work stem from the introduction of a new paradigm and theoretical framework rather than the specifics of DP implementation. Consequently, the reviewer's assertion that "the paper are therefore thin" due to this implementation detail is unfounded. Contrary to the reviewer's subsequent assertion, our methodology does not aggregate user data but rather aggregates probabilistic model weights, consistent with other FL works.
**Weakness 2 (Relevance):**
I have never heard of neuro symbolic learning in my life before I was requested to serve as an emergency reviewer for this paper. FL has attracted a large research community because models trained with FL are deployed to serve billions of queries every day. However, I've never heard of anyone using FL with NSL, and I've been working on FL since 2018. I have low confidence that this paper will be of interest to the community at ICML, or of any practical value for entities interested in implementing FL with NSL.
**Response to Weakness 2:**
The claim of 'never heard of neuro-symbolic learning' cannot serve as a basis for doubting the potential value and interest of federated NSL for the ICML community. In fact, this underscores the novelty of our approach as it represents a new paradigm. Moreover, there is a dedicated Nesy conference (https://sites.google.com/view/nesy2024) and workshop (https://nuclear-workshop.github.io/). The number of papers presented at ICML conferences is consistently increasing year by year (https://arxiv.org/pdf/2105.05330.pdf), and research direction is being pursued by various Internet companies, such as IBM (https://research.ibm.com/topics/neuro-symbolic-ai).
**Weakness 3 (Evaluation):**
I don't think the evaluation setup is fair, specifically 4.1.2. This setup does not seem to reflect a real-world setting and from what I gather, it's chosen to favor the specific design of the author's proposed method. The crux of the evaluation is that the other methods do not learn weights for hidden variables when they do personalized FL. I can certainly believe this. However, it's the author's job to convince me that this shortcoming is actually related to some real-world problem. Instead, the authors have set up the evaluation such that the hidden variables are of the utmost importance, because each unseen client can only be learned from the neighboring seen clients. This just shows me that the proposed method is able to handle this setting. But why do we care about this setting?
**Response to Weakness 3:**
Our evaluation setting is consistent with baseline works like Per-FedAvg, pFedMe and pFedBayes. They all similarly set distribution shifts on class labels for datasets such as MNIST or CIFAR, and the nature of their setting this way is the same as ours in setting distribution shifts on the NER category. Regarding the claim that we do not set up fairly, we would like to emphasize that other methods can also learn this hidden variable information, such as Bayesian-based methods, and that using explicit variable information in other methods like LR-XFL will probably outperforms our method in logical rule learning for a small range of data, but struggles to handle large scale rule discovery as well as our method does. This illustrates the fairness of our setup.
**Weakness 4 (Formatting):**
The running title still reads "Submission and Formatting Instructions for ICML 2024". Equation (8) is surrounded by an excess of whitespace. The use of inline rules makes some of the text difficult to parse. I believe the paper could do with a more detailed background section given that the average reader may not be familiar with NSL.
**Response to Weakness 4:**
The whitespace around equation 8 is a result of the latex template floating layout, we have not added any artificial whitespace symbols, this is the result of a trade-off between balancing the readability of all the formulas and adhering to the overall layout of the icml format. We have had a two-paragraph introduction to the Neuro-symbolic background, and even included a detailed graphical illustration in Figure 1. These minor edits will be addressed in the final version.
**Questions:**
How does the proposed method compare to other federated EM methods, ex https://arxiv.org/pdf/2310.15330.pdf
**Response to Response to Questions:**
Firstly, the related work mentioned by the reviewer also focuses on transmission distribution probability estimation. If the reviewer acknowledges the privacy-preserving nature of the transmission medium in that work, why is our work on the same transmission distribution probability estimation being completely rejected? Secondly, this work differs significantly from ours according to the part of section 2.3 and Figure 1 in this related work. As illustrated in Table 1 of our paper, this work adheres to the traditional paradigm of having the same problem dimensions between the upper and lower layers of a traditional server-client model, instead of the innovative paradigm of our work, which can handle inconsistencies in the types of upper and lower models. Because its E- and M-steps are performed locally, the server's role is to use the L2-norm for personalized constraints. This essentially aligns with the type of regularization-based PFL in Table 1. Thus, it is not as innovative as ours. This related work further underscores the importance and novelty of our contribution. Finally, we want to emphasize that this is a concurrent work, we will cite this paper in our final version.
In summary, we wish to emphasise that it is unreasonable that this emergency reviewer provided negative feedback because he/she has not heard about neuro symbolic learning. We are confident that this will not go unnoticed by any experienced researcher. Thank you for giving our paper its due considerations all the effort put in to ensure a productive review process for ICML.
From Authors of paper \#6855.
## For AC/PC
Dear AC/PC,
We are delighted to see that all the reviewers gave positive feedback on this work. Comments from all the reviewers were really helpful, which we believe have been fully addressed in detail in our rebuttal.
Also, we have further provided numerical experiments across different levels of data heterogeneity. These experiments show that the higher the degree of heterogeneity, the more outstanding our methods become, leading to FedNeSy leveraging this information exchange more effectively.
We thank you for your time in processing our paper and look forward to engaging discussions if there are questions raised by the reviewers.
Authors of paper \#6855.
## For Reviewer LsdL
Thank you for your helpful comments and questions.
**Weakness 1:**
The paper claims that the proposed method is privacy preserving. However, the only argument is that only the rule prior and posterior probability values are shared. Why does sharing only the rule distribution preserve privacy? Distributional information may also leak private information---which is why differential privacy is needed. Is the proposed method really privacy preserving without relying on other privacy preserving mechanisms?
*Response to Weakness 1:*
Under FedNeSy, the level of privacy preservation is aligned with existing federated learning approaches such as FedAvg. Rather than exposing the sensitive local data, we follow the established approach of exchanging model parameters. In this paper, we focus on designing the neural-symbolic learning technique under federated settings. Enhancements to the privacy preservation aspect will be further explored in future work. We will clarify this point in the revised manuscript. Thank you.
**Questions 1:**
How does the proposed method handle different levels of data heterogeneity? I would expect this kind of experiments, as different methods may excel at different levels of data heterogeneity.
*Response to Questions 1:*
We further tested the performance of the proposed method in a comparsion with other baselines regarding different levels of data heterogeneity. It can be seen in the following figures that FedNeSy outperforms the baselines for different ratios of training-testing data heterogeneity. The settings are as follows:
> '0% (homo)' referes to the results where training and testing data have the same data distribution.
> '33% (hetero)' denotes testing data with 33% data following a different distribution from the training data (unseen testing data), and other data are homogenous testing data.
> '50% (hetero)' denotes testing data with 50% data following a different distribution from the training data (unseen testing data), and other data are homogenous testing data.
These scenarios characterize the heterogeneity of distribution shifts in federated learning. The results show that the higher the degree of heterogeneity, the more outstanding our method is. Therefore, compared with other methods, FedNeSy addresses the issue of complementary training-testing data heterogeneity across clients more effectively.

**Questions 2:**
How doe the running time of the proposed method compare to baselines?
*Response to Questions 2:*
The following table shows the running time corresponding to Figure 2 (a). FedNeSy is the second fastest among all comparison approaches.
| Algorithm | FedAvg | pBayes | PerFedAvg | pFedMe | LR-XFL | FedNeSy |
|------------|--------|--------|-----------|--------|--------|---------|
| Time (s) | 150.6 | 376.0 | 151.2 | 188.4 | 136.1 | 149.3 |
## For Reviewer JXw8
Thank you for your helpful comments and questions.
**Question 1:**
What is the need for a federated learning approach for neuro-symbolic learning?
*Response to Questions 1:*
Neuro-symbolic learning tasks, like question-answering systems, often adopt a server-client architecture, as exemplified in Figure 1 (a). The server side typically hosts a general neuro-symbolic model. When multiple users interact with the shared model, it needs to gather diverse contexts from these users to update itself and provide personalized logical responses. The shared model must address rule conflicts arising from rule heterogeneity while handling sensitive user data. For such one-to-many server-client systems involving sensitive data, a federated version of neuro-symbolic learning will be useful.
**Question 2:**
Is the proposed method applicable to other federated learning settings?
*Response to Questions 2:*
Yes, our learning structure is generic, allowing for the modeling of coupling between upper and lower level problems with diverse data distributions, thereby covering a broad range of federated probabilistic inference problems. The proposed federated variational EM algorithm, to the best of our knowledge, is the first of its kind to address this class of problems by decoupling the optimization process dependencies between the server and clients.
**Question 3:**
What are the limitations of the proposed method?
*Response to Questions 3:*
The current version of FedNeSy is based on first-order logic, but this limitation is not unique to our work. It is also present in general neuro-symbolic learning techniques. In subsequent work, the incorporation of default logic, temporal logic and fuzzy logic might hold promise in enhancing the expressiveness and modelling capabilities of the federated neuro-symbolic learning area.
## For Reviewer CBTT
Than you for your helpful comments and questions.
**Question 1:**
In Figure 2, (a1) demonstrates that FedNSL significantly outperforms other methods, while (a2) shows that FedNSL is inferior to other methods. What could explain this phenomenon?
*Response to Questions 1:*
Subgraphs (a1) and (a2) only illlustrate the performance of the training sets for individual clients. Our goal is to emphasize that combining them together, i.e., examining the results of subgraph (a3), is more meaningful.
Specifically, this experiment's setup is distributionally heterogeneous. For example, client1 has data from sub-distributions 1 and 2 of a GMM, lacking data features from sub-distribution 3; while client2 has data from sub-distributions 2 and 3, lacking data features from distribution 1. Due to the presence of unbalanced classes, handling this kind of heterogeneity inappropriately can lead to high accuracy in certain classes, while hurting the accuracies of some other classes. Since the missing classes in the two classifiers are complementary, this leads to a competing accuracy scenario between client1 and client2, meaning that if one client's accuracy is high, the other's tends to be low. This explains why the compared methods, although superior on (a2), perform much worse on (a1).
>The significant advantage of our method is the ability to balance this complementary heterogeneity-induced competing accuracy problem. Although it might not perform the best on (a2), its overall average result is superior to other methods.
**Question 2:**
The experimental setup includes extremely heterogeneous datasets, where the training and testing distributions of clients differ significantly. This is uncommon in real-world scenarios. I would like to understand the superiority of FedNSL compared to existing methods under different levels of heterogeneity, and even homogeneous distributions.
*Response to Questions 2:*
We further tested the performance of the proposed method in a comparsion with other baselines regarding different levels of data heterogeneity. It can be seen in the following figures that FedNeSy outperforms the baselines for different ratios of training-testing data heterogeneity. The settings are as follows:
> '0% (homo)' referes to the results where training and testing data have the same data distribution.
> '33% (hetero)' denotes testing data with 33% data following a different distribution from the training data (unseen testing data), and other data are homogenous testing data.
> '50% (hetero)' denotes testing data with 50% data following a different distribution from the training data (unseen testing data), and other data are homogenous testing data.
These scenarios characterize the heterogeneity of distribution shifts in federated learning. The results show that the higher the degree of heterogeneity, the more outstanding our method is. Therefore, compared with other methods, FedNeSy addresses the issue of complementary training-testing data heterogeneity across clients more effectively.

**Question 3:**
Figure 2 lacks a meaningful caption to help readers understand the figure.
*Response to Questions 3:*
The revised caption is as follows:
"Group (a) presents the numerical experiment results. The first row contains (a1), (a2) and (a3), which respectively show the training accuracy of the classifiers for client 1, client 2 and the average results. The second row, featuring (a4), (a5) and (a6), displays the unseen testing accuracy for the classifiers of client 1, client 2 and the average results. Group (b) shows the real-data experiment results, including F1-scores in (b1), logic accuracy in (b2) on both the unseen and seen testing data with and without KL-divergence rule distribution constraints (denoted by “W/O. KL” and “W. KL”), and (b3) illustrates how different coefficients of KL-divergence constraint affect the personalization performance."