## General response ##
We thank all reviewers for their time and valuable feedback, and appreciate that reviewers found our work novel and relevant.
We have now conducted additional experiments to answer the largest concerns raised by reviewers, and hope to have addressed some misunderstandings.
### We would like to highlight the main findings from additional experiments: ###
**Applicability to larger groups of agents and larger coalition sizes:**
In an additional Tragedy of Commons experiment, we expanded observed coalitions from 3 to 10 and increased demonstrator agents from 12 to 120. This presents roughly $10^9$ possible coalitions among the 120 agents. We employed the sampling-based estimator for EVs from section 4.3 and used the methodology described in section 5.2. In this evaluation, Behavior Cloning (BC), which imitates all agents in the dataset, achieves only 4% of the performance of our method EV-BC, which only imitates agents with high Exchange Value. Compared to the smaller-scale experiment in the main paper where BC reached 12.3% of EV-BC's performance, these evaluations validate our method's applicability also to larger group and coalition sizes.
EV-BC (Ours) outforms BC by a large margin for different DVFs (metrics):
| Metric | BC | EV-BC (Ours)|
|-----------------------------------------------------------|-----------|-----------|
| Total final Resources ($v_{final}$) | 388.3 | 11348.5 |
| Total consumed Resources ($v_{total}$) | 43.6 | 381.09 |
| Minimum consumed resources by any agent ($v_{min}$) | 0.05 | 0.37 |
**Large performance gap to baseline methods also for non-fully-anonymized datasets:**
In our main paper, we evaluated fully-anonymized datasets in Overcooked (see Results in Table 2), where each agent appears in only one coalition, mirroring many real-world situations and presenting the most challenging credit assignment problem. We now further assessed our method with multiple data points per agent. When the dataset contains observations for 30% of all possible coalitions, our approach surpasses the key baseline (reward-BC) by 32.21% — a more significant margin than the 20.35% from the main paper's fully-anonymized datasets, further underlining the importance of our method that prevents imitating the behavior of agents whose behavior is not aligned with the DVF.
Results for 30% of coalitions observed:
| Imitation method | Cramped Room $D^{\lambda}$| Coordination Ring $D^{\lambda}$ | Cramped Room $D^{adv}$| Coordination Ring $D^{adv}$|
|-|-|-|-|-|
| vanilla BC | 12.6 ± 3.34 | 18.13 ± 6.21 | 31.7 ± 8.96 | 14.21 ± 3.78 |
| reward-BC | 64.33 ± 6.1 | 29.4 ± 7.01 | 75.7 ± 13.98 | 16.8 ± 5.66 |
| EV-BC (ours) | **101.33** ± 14.37| **38.3** ± 6.81 | **138.8** ± 18.6 | **22.0** ± 6.1 |
### Additional explanations around most frequent misunderstandings: ###
It appeared unclear to reviewers why exactly (1) a sampling-based estimate and (2) clustering of agents is needed to compute Exchange Values for real-world datasets.
**Sampling-based estimate for EVs**: Not every possible combination of agents will have observations in real-world datasets. For instance, two athletes might never team up in a sport. Therefore, instead of having a full set of observations, the dataset might have just a subset of them. From this (sampled) subset, Exchange Values can still be estimated without any bias as outlined in section 4.3.
**Clustering of agents**: In cases where only a few coalitions are observed, the Exchange Values derived from the dataset might be inaccurate. This is because some agents might be observed only in very few instances. To reduce the variance in the estimate of Exchange Values, agents that behave similarly are grouped together. This group (or cluster) is then considered as one "meta agent" for the purpose of computing Exchange Values. By doing so, the estimate becomes more stable as it is derived from a bigger pool of observations. Figure 4 demonstrates that as the size of the observation set decreases, the estimation error in EVs rises. However, through clustering, this estimation error is significantly reduced, as highlighted by the rightmost data points in Figure 4.
We are more than happy to answer any follow-up questions.
Thank you for your time.
Best wishes,
The Authors
## Reviewer 2uWd (Score 7) ##
**Weaknesses / Questions**
> When training with BC, do you train a policy which controls an individual agent over all considered agents' trajectories or a joint policy across the entire team?
R1: We train an individual policy, as we assume that the trained agent shall be able to collaborate with other agents (which may be controlled by other entities). Our training procedure for BC follows that of [A], using the given codebase.
>In case the trained policy controls a single agent, do all agents in the environment use this same policy during evaluation?
R2: Yes, all agents use the same policy during evaluation. Note that the considered environments are symmetric with respect to agents, so this is possible. If this not was the case, one policy per agent would be trained.
>Given the metric of exchange values will focus the imitation learning on agents which would positively contribute to *any* team, how trained agents perform in random teams of other policies? Would the agents trained with EV-BC be more robust?
R3: This is an interesting consideration. We agree with the possibility that the Ad-Hoc teamplay [B] capabilities of EV-BC agents are higher (they are more robust to unseen policies). We are currently running this anaysis and hope to add results before the end of the discussion period.
>What does the relative achieved alignment percentage in Table 1 correspond to?
R4: This metric is the relative performance with respect to the maximum DVF score in each row (in this case the maximum score is always achieved by EV-BC). Specifically, BC achieves (23.5%, 12.3% and 1.3%) of the score achieved by EV-BC. For clarity, we have now also added a table with the raw scores to Appendix 5.
>An indication of the distribution of performance within each dataset for Table 2 would be helpful to get a sense of the relative performance of trained agents with respect to the dataset. E.g. one could provide the minimum, mean, and maximum performance of any trajectory in each respective dataset.
R5: Thank you for this suggestion. We have now added this to the main paper and also updated Table 2, where we now report performance relative to the 95% percentile performance in each dataset. The minimum, mean and maximum of each dataset are as following:
Dataset statistics:
| Imitation method | Cramped Room $D^{\lambda}$| Coordination Ring $D^{\lambda}$ | Cramped Room $D^{adv}$| Coordination Ring $D^{adv}$|
|-|-|-|-|-|
| minimum |0 | 0 | 0 | 0 |
| mean |20.6 ± 33.58 | 12 ± 19.39 | 16.91 ± 40.64 | 3 ± 11.15 |
| maximum |150 | 80 | 160 | 80 |
>It would be helpful to give an indication of the computational cost of the proposed clustering approach.
R6: Please refer to Appendix 6.5 for details on the computational demand. Generally, the variance-maximisation step consumed most computational time, taking up to two hours per experiment on a CPU.
**Citations:**
[A] Carroll et al, "On the Utility of Learning About Humans for Human-AI Coordination", NeurIPS 2019
[B] Cui et al., “K-level Reasoning for Zero-Shot Coordination in Hanabi", NeuriIPS 2021
## Reviewer Knch ##
**Weaknesses**
>It is challenging to estimate the Shapley value as discussed in [A]. It is not clear how Eqn (2) is derived between line 151-160.
R1: We recognize that this is a slightly non-trivial derivation, but it is necessary as a core contribution of our paper. The Exchange Value is defined using the Exchange contribution (introduced in line 146). This is analogous to the marginal contribution used to define Shapley Values. In lines 153 to 160 we establish a connection between Exchange Values and Shapley Values, based on the Efficiency property of Shapley Values (see e.g. [B] for properties of Shapley Values; we found that Wikipedia has an adequate summary).
>Clustering by maximizing the variance can be a good measure. However, in alignment problem, some trajectories in the offline dataset are not safe and it has very small friction. The key drawback is no method proposed to process outliers.
R2: We are not sure whether we understand this question correctly and kindly ask for clarifiction. If a few agents in the dataset demonstrate outlier (misaligned) behavior, our method assigns them a low Exchange Value (as they have a negative contributiuon to the DVF), and avoids imitating their behavior. Therefore there is no drawback from outliers.
>In the experiment, the number of the agent is small. In overcooked scenario, there are only 2 agents, which is trivial.
R3: We politely disagree that the 2-agents case is trivial. In fact, credit assignment is a hard problem for any coalition size, if agents are never observed acting individually (as is the case in the Overcooked experiments). Generally, interaction effects among agents are unknown a-priori, even for group sizes of two.
>It is not clear why clustering is needed? The motivation is not clear.
R4: In cases where only a few coalitions are observed, such as in fully-anonymised real-world datasets, the Exchange Values derived from the dataset might be inaccurate. This is because some agents might be observed only in very few instances. To reduce the variance in the estimate of Exchange Values, agents that behave similarly are grouped together. This group (or cluster) is then considered as one "meta agent" for the purpose of computing Exchange Values. By doing so, the estimate becomes more stable as it is derived from a bigger pool of observations. Figure 4 demonstrates that as the size of the observation set decreases, the estimation error in EVs rises. However, through clustering, this estimation error is significantly reduced, as highlighted by the rightmost data points in Figure 4.
>The writing needs improvement
R5: We have updated the manuscript and included additional explanations especially in the methods section (making use of the additional page for the camera-ready). If there is more specific feedback on what needs improvement, we will be happy to change it.
**Questions:**
>Can you provide an upper-bound performance by using the human-feedback, i.e., score is given by human?
R1: We are not sure we understand this question correctly and kindly ask for clarification. We assume the reviewer is asking whether we could derive an upper bound by using human annotations for the behavior of individual agents, thereby using human feedback to perform credit assignment instead of computing Exchange Values for each agent. Unfortunately, we are unaware of a datast that provides such human annotations and we were unable to source such annotations ourselves. Instead, we work with a human-chosen proxy annotation in Overcooked which has been derived by the authors of [C], i.e. the keystrokes per second of a participant. Further, we would like to point out that while sourcing human annotations at a trajectory level may be feasible (in our setting given by the desired value function DVF), sourcing per-agent behavior annotations may often be either infeasible at scale or may be impossible in complex domains.
>When the number of the agent is increasing, is the method still good?
R2: In an additional Tragedy of Commons experiment, we expanded observed coalitions from 3 to 10 and increased demonstrator agents from 12 to 120. This presents roughly $10^9$ possible coalitions among the 120 agents. We employed the sampling-based estimator for EVs from section 4.3 and used the methodology described in section 5.2. In this evaluation, Behavior Cloning (BC), which imitates all agents in the dataset, achieves only 4% of the performance of our method EV-BC, which only imitates agents with high Exchange Value. Compared to the smaller-scale experiment in the main paper where BC reached 12.3% of EV-BC's performance, these evaluations validate our method's applicability also to larger group and coalition sizes.
EV-BC (Ours) outforms BC by a large margin for different DVFs (metrics):
| Metric | BC | EV-BC (Ours)|
|-----------------------------------------------------------|-----------|-----------|
| Total final Resources ($v_{final}$) | 388.3 | 11348.5 |
| Total consumed Resources ($v_{total}$) | 43.6 | 381.09 |
| Minimum consumed resources by any agent ($v_{min}$) | 0.05 | 0.37 |
**Conclusion**
We hope that the additional experiments that verify the applicability of our method to large groups of agents, and the clarifications on the most confusing aspects, are satisfactory. If so, we kindly ask to consider updating the score.
**Citations**
[A] "Shapley Q-value: A Local Reward Approach to Solve Global Reward Games", Wang et al., AAAI 2020
[B] "Computational aspects of cooperative game theory", Chalkiadakis et al., Springer Nature 2022
[C] "On the Utility of Learning About Humans for Human-AI Coordination", Carroll et al, NeurIPS 2019
## Reviewer 7ion
**Weaknesses:**
>The theoretical derivation of the method of maximizing variance in the paper is based on the assumptions of agent behavior similarity and inessential game. The environments in the experiments cannot guarantee this assumption, and the relevance between theory and experiments still needs to be further explained.
R1: Thank you for this remark. We would like to point out that the inessential game assumption is only made for the *theoretical analysis of the clustering objective* and that in most scenarios this assumption is dropped. For example, in Fig. 4 only the clustering results depend on it, and all others are free from such assumptions. Furthermore, this assumption is frequently made in credit assignment problems, e.g. for computing Shapley Values in the field of machine learning interpretability [1,2].
>The calculation of Exchange Value requires computing the exchange contribution over all permutations of and all non-empty slices with known team coalitions, which is cumbersome. The application in complex environments may be limited.
R2: It is true that to directly calculate the *exact* Exchange Value (EV) we must consider all permutations. To mitigate this problem, we introduce a sampling-based estimator of EVs (which converges to the true EV in expectation) in section 4.3, and show in section 5.2 that it yields strong results.
We further conducted additional experiments in the Tragedy of the Commons domain that verify the applicability of the EV estimator to a problem setting with ~10^9 possible coalitions -- please refer to the general response for details.
> .. nice to see more details on the quantitative experiments related to Exchange Value.
R3: Figure 4 shows a *quantitative* analysis of the estimation error in Exchange Values, as a function of the size of set of observed coalitions. If any quantitative experiment is deemed missing, we will be happy to add it.
**Questions:**
>What are the metrics of the values in Table 1? The results in the table show that imitating the agents that have positive contributions to DVF can achieve better performance.
R4: This metric is the relative performance with respect to the maximum DVF score in each row (in this case the maximum score is always achieved by EV-BC). Specifically, BC achieves (23.5%, 12.3% and 1.3%) of the score achieved by EV-BC. For clarity, we have now also added a table with the raw scores to Appendix 5.
>Can positive contributions to DVF only be measured by EV? What are the advantages of EV over other possible measurement of contributions to DVF methods?
R5: The most frequently used method to evaluate the contributions of individual agents to group outcomes is Shapley Values. As outlined in section 4.1, Shapley values are however not applicable in most real-world scenarios as these often only permit specific coalition sizes. The permitted group sizes generally depend on the application (environment), e.g., a football game typically permits exactly 11 players, or an autonomous driving dataset might only contain observations for groups of three cars and more. The introduced Exchange Values mitigate the problem of non-permited group sizes.
>In Figure 2, how is the latent parameter λ defined and what is its physical meaning? Is EV the only one function related to the latent parameter λ? What is the necessity of using EV instead of other forms?
R6: Thank you for this remark. For λ = 1 agents act (near)-optimal, for λ = −1 agents act adversarially. This is implemented by sampling subgoals for agents according to lambda: for λ = 1 only optimal subgoals are sampled, for λ = −1 only adversarial subgoals are sampled, and for intermediate values, the probability changes proportionally to λ (see lines 230-241).
>What is the rationale and basis for using keystrokes per second as a latent feature of human behavior trajectories?
R7: Previous work [C] found keystrokes per second to be the best available metric in Overcooked. To avoid any bias in our evaluation, we follow this previous work.
>.. on non-fully anonymous datasets, does the EV-BC method have an advantage?
R8: Yes, the advantage is independent of whether the dataset is fully-anonymized or not. The focus on fully-anonymized datasets was chosen as these are most alike real-world datasets (such as the human dataset used in the evaluation), and simultaneously represent the hardest possible credit assignment problem.
We now conducted additional experiments on the $D^\lambda$ and $D^{adv}$ datasets with 30% of coalitions observed (instead of fully-anonymised).
We found that the performance gap is similar to that of the fully-anonymised setting presented in the main paper, which reward-BC achieving 20.35% and 32.21% less than EV-BC, respectively (see table below). Note that this expeirment cannot be run for the fully-anonymised human datasets.
Results for 30% of coalitions observed:
| Imitation method | Cramped Room $D^{\lambda}$| Coordination Ring $D^{\lambda}$ | Cramped Room $D^{adv}$| Coordination Ring $D^{adv}$|
|-|-|-|-|-|
| vanilla BC | 12.6 ± 3.34 | 18.13 ± 6.21 | 31.7 ± 8.96 | 14.21 ± 3.78 |
| reward-BC | 64.33 ± 6.1 | 29.4 ± 7.01 | 75.7 ± 13.98 | 16.8 ± 5.66 |
| EV-BC (ours) | **101.33** ± 14.37| **38.3** ± 6.81 | **138.8** ± 18.6 | **22.0** ± 6.1 |
>Does EV bring a large computational cost? Is there any experimental analysis and explanation for the computational cost of EV?
R9: We would like to refer to Appendix 6.5 for an overview of computational demand. Generally, EV computation does not result in a significant computational overhead. The variance-maximisation clustering can take up to two hours on a regular CPU.
**Citations**
[A] "A Unified Approach to Interpreting Model Predictions", Lundberg et al., NeurIPS 2017
[B] "Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression", Cover et al., AISTATS 2021
[C] "On the Utility of Learning About Humans for \-AI Coordination", Carroll et al, NeurIPS 2019
## Reviewer L5XT ##
**Weaknesses / Questions**
>The writing could have been improved. The content (sections), especially in section 4 could have improved by connecting sub-sections.
R1: Thank you for the feedback. We have now updated the manuscript and added additional clarifications between the different parts of section 4 (note that this is possible as the camera ready version allows for an extra page). We would be thankful for any additional pointers in case any transitions seem unclear.
>The main objective is to learn imitation policies for an AI agent. Where is it? How did you justified the desired value function delivers this. It is unclear for me.
R2: Our problem considers imitating diverse datasets while ensuring alignment with a given DVF, which we fomulate as a weighted Behavior Cloning objective with weights $\beta$. In our implementation, the weights are given as $\beta^n$, where $n$ refers to the agent identity $n \in \mathcal{N}$ that generated the action $a^n_i$ in state $s_i$, where $\mathcal{N}$ is the set of all agents.
The objective then reads as
$\min_{\theta} \sum_{(n \in \mathcal{N})} \sum_{(s_i, a^n_i) \in \mathcal{D}} \ \beta^n \cdot \lVert \pi^\theta(s_i) - a^n_i \rVert^2_2$
,where $\pi^\theta$ is the learned imitation policy and $\mathcal{D}$ is the set of demonstrations.
In our experiments for Table 2, we set $\beta^n$ to 1 for agents that have a positive Exchange Value and to 0 otherwise.
Note that the exact objective depends on the environment, especially depending on whether the ordering of agents in the environment is of relevance, which is not the case for the environments presented in this work.
>Could you please justify the use of sampling coalitions for estimating exchange values in section 4.3
R3: Not every possible combination of agents will have observations in real-world datasets. For instance, two athletes might never team up in a sport. Therefore, instead of having a full set of observations, the dataset might have just a subset of them. From this (sampled) subset, Exchange Values can still be estimated without any bias as outlined in section 4.3.
>What is the necessity for clustering? what types of clustering algorithms are explored (K-means, spectral clustering, etc.)
R4: In cases where only a few coalitions are observed, the Exchange Values derived from the dataset might be inaccurate. This is because some agents might be observed only in very few instances. To reduce the variance in the estimate of Exchange Values, agents that behave similarly are grouped together. This group (or cluster) is then considered as one "meta agent" for the purpose of computing Exchange Values. By doing so, the estimate becomes more stable as it is derived from a bigger pool of observations. Figure 4 demonstrates that as the size of the observation set decreases, the estimation error in EVs rises. However, through clustering, this estimation error is significantly reduced, as highlighted by the rightmost data points in Figure 4.
To arrive at the clustered groups of agents, as described in the experiments in section 5.2., we first pre-cluster agents using PCA and k-means, before applying the variance maximisation approach introduced in section 4.4.
>"we consider only permit certain coalition sizes" what are these sizes and why these constraints. It is unclear.
R5: The permitted group sizes generally depend on the application (environment), e.g., a football game only permits exactly 11 players or an autonomous driving dataset might only contain observations for groups of three cars and more.
Such restrictions in group size make the application of Shapley values impossible (see beginning of section 4.1), which is why we propose the concept of Exchange Values.
>There is no section describing limitation and potential societal impact. This should have been considered as the target AI agent's role to imitate behaviour which could have negative societal impact.
R6: We discuss both limitations and potential negative societal impacts in the conclusion. Please refer to the conclusion in the PDF, as this rebuttal response has a space limit.
**Conclusion**
We hope that our answers addressed the questions in the review, and hopefully removed the reasons that justify a negatively-leaning score. If not, we will be happy to clarify further.