https://openreview.net/forum?id=lSbbC2VyCu&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2023%2FConference%2FAuthors%23your-submissions
https://www.overleaf.com/7124721988jjjqttzrzdvk
# General response
We sincerely thank the reviewers for their time and their insightful feedbacks. We're encouraged by the many positive comments, which highlight the main features of our submission.
- (*topic*) We "address the reward misspecification problem [...] in current RLHF frameworks" (R.QPfR), a problem "that frequently arise in the emerging and important field of aligning generative models with human preferences" (R.TSwH). This "topic [is] of increasing interest and relevance to the community" (R.bXWy).
- (*methodology*) We propose rewarded soup (RS) which "involves individually training multiple networks, each assigned to a different proxy reward, and then linearly combining these networks" (R.ntSF). "The proposed idea is effective yet efficient as it does not require additional training" (R.DNE5) contrary to "to the more costly baselines" (R.bJvT) such as MORL.
- (*experiments*) Empirically, we "did a lot of experiments on different task which shows that this interpolating strategy is universal under different application scenarios, while with good performance" (R.stHc). "The paper presents interesting results for many practically relevant and useful benchmarks" (R.bXWy).
- (*theory*) "The approach is well-motivated theoretically" (R.TSwH) and "the theory part connects with experiments very well" (R.TSwH).
We have taken note of the questions and suggested weaknesses, that we directly answer in response to each reviewer.
Most of our answers are based on quotes from the main paper or the Appendix (in particular the theoretical Appendix B.2), that the reviewers might have overlooked.
In contrast, a few questions required new plots to be answered, that we gather in the one-page rebuttal pdf. Specifically:
- Table 1 shows generated summaries by our method. This qualitative inspection is enriched by quantitative evaluations in Figure 1 and 2, with general-purpose quality metrics such as perplexity for text and FID for images. We validate that the generated samples from interpolated models do not suffer from reduced quality ([R.stHc.Q3](https://openreview.net/forum?id=lSbbC2VyCu¬eId=GlQnTpp2gl)).
- Figure 3 quantifies the average efficiency gain from RS with regard to the MORL baseline ([R.bJvT.Q3](https://openreview.net/forum?id=lSbbC2VyCu¬eId=Hztsf2jtoo) and [R.QPfR.Q3](https://openreview.net/forum?id=lSbbC2VyCu¬eId=58jAIy2tXU)).
- Figure 4.a and 4.b plot RS's fronts over the course of fine-tuning, and validates that the LMC holds even for longer trainings ([R.TSwH.Q3](https://openreview.net/forum?id=lSbbC2VyCu¬eId=np0m5ACZtx)).
- Figure 4.c illustrates the empirical difference and the complementarity of rewarded soups and model soups ([R.DNE5.Q1](https://openreview.net/forum?id=lSbbC2VyCu¬eId=6LMJAJD6vx)).
We hope our responses clarify the expressed concerns. If there is anything else we can do to further improve our work, please let us know.
# stHc
We thank R.stHc for the positive feedback on the clarity of our idea and the experiments. We would like to respond to R.stHc's remarks as follows.
---
### Q1. Novelty
Our approach is novel from two perspectives.
The first **conceptual** novelty is arguing for a **multi-objective paradigm** to reduce **reward misspecification** when aligning deep generative models. This first novelty is critical to "handle the diversity of human preferences" (l.50), and as further detailed in Appendix A.1, to "support decision-making" (l.872), "interpretability and explainability" (l.878).
The second **empirical** novelty is proposing rewarded soups, based on new setups/conditions where the **linear mode connectivity** holds (in reinforcement learning, with diverse rewards, even in the multimodal case) and thus where weight interpolation can be used. This second novelty is critical to reduce "the computational, memory, and engineering costs involved" (l.106) in traditional MORL approaches, and as further detailed in Appendix A.2, to be "compatible with the inherent iterative engineering process of alignment" (l.890).
Moreover, we want to point out that in Appendix B.2 "we provide **theoretical** guarantees for the **near-optimality of RS** when considering quadratic rewards" (l.908), as referenced l.141-143 and l.146 in Remark 1. Specifically, in Lemma 3, we bound the reward difference between the optimal policy and our interpolated policy. We give more theoretical details in our response to [R.bXWy.Q2](https://openreview.net/forum?id=lSbbC2VyCu¬eId=rSiwrlT8Be).
---
### Q2. Limitations for the design of networks for LMC?
In our experiments, we consider different network architectures (transformers, CNNs, and MLPs), with various activation functions.
We also investigate different training procedures: with low-rank adapters, partial or end-to-end fine-tunings.
We do so for many different tasks and modalities: text generation, image captioning, image-to-test generation, visual grounding, visual question answering, etc.
Our empirical observation is that, across those setups, the **LMC is architecture-agnostic, procedure-agnostic, task-agnostic and modality-agnostic**.
The main condition we require is the shared pre-trained initialization [Neyshabur2020], as emphasized in Remark 1; this "prevents the weights from diverging" (l.145) and forces them "to remain close" (l.146).
The other condition, suggested by the literature [Li2022,Ilharco2023] and as also discussed in [R.TSwH.Q4](https://openreview.net/forum?id=lSbbC2VyCu¬eId=np0m5ACZtx), is that the architecture has enough trainable parameters. Indeed, larger networks may facilitate the orthogonality of the fine-tuned updates; then [Ilharco2023] "speculate that this [orthogonality] enables the combination of task vectors via addition with minimal interference". In conclusion, our experiments and the literature suggest **that the network design is not critical for the LMC, as long as the network is pre-trained and sufficiently parameterized**. Those constraints are arguably minimal given the predominance of the foundation model paradigm and the scaling trend in deep learning.
[Neyshabur2020] What is being transferred in transfer learning? NeurIPS.\
[Li2022] Branch-Train-Merge: Embarrassingly parallel training of expert language models.\
[Ilharco2023] Editing models with task arithmetic. ICLR.
---
### Q3. Does the method harm the absolute quality of the produced samples? show the generated samples, and provide more evaluation
Qualitatively, samples generated by weight interpolated models do not suffer from reduced quality. This was visible for text-to-image generation with diffusion models in Figure 12 from Appendix E.3, where we state: "we can see that all interpolated models produce images of similar quality compared to fine-tuned models". Moreover, our anonymous website (referenced l.856, l.1065, and l.1084), also includes generated samples for the locomotion task and for the text-to-text summarization task. For the sake of completeness, we now include **examples of generated summaries** in the Table 1 from the one-page rebuttal pdf; qualitatively, the summaries generated by interpolated models remain grammatically coherent.
To **quantitatively** validate this insight, the one-page rebuttal pdf includes new plots evaluating the samples generated by RS.
- Figure 1 evaluates the generated summaries when $\lambda$-interpolating between two LLMs fine-tuned on two summary rewards. We leverage two text metrics; the first is (i) **perplexity** (exponentiated average NLL of the generated summaries) according to MLMS [Salazar2020] and GPT2 (following [Lee2021] and this [blog](https://huggingface.co/docs/transformers/perplexity)); the second is (ii) **quality**, as estimated by this [newspaper quality model](https://huggingface.co/valurank/distilbert-quality).
- Figure 2 evaluates the generated images when $\lambda$-interpolating between two diffusion models fine-tuned on two aesthetic rewards. We leverage two standard image metrics; the first is (i) **FID** [Heusel2018] measuring image realism; the second is (ii) **CLIPScore** [Hessel2021] measuring image-text alignment.
In conclusion, we confirm quantitatively that **RS does not deteriorate quality**. More precisely, by interpolating the weights, we also interpolate the metrics; intermediate values of $\lambda$ even sometimes increase quality. We will detail this analysis in the revised paper, and would be pleased to include any other suggested quality metrics.
[Salazar2020] Masked Language Model Scoring. ACL.\
[Lee2021] Towards Few-Shot Fact-Checking via Perplexity. ACL.\
[Heusel2018] GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NeurIPS\
[Hessel2021] CLIPScore: A Reference-free Evaluation Metric for Image Captioning. ACL.
# DNE5
We thank R.DNE5 for highlighting the organization of the paper and the diversity of our experiments.
---
### Q1. Similarity and differences with model soups
R.DNE5's main concern relates to the similarity between rewarded soups (RS) and model soups (MS).
First, we totally acknowledge similarity with MS; actually, as stated l.65, "the name rewarded soups follows the terminology of model soups".
Indeed, RS and MS both average the weights of models fine-tuned from a shared pre-trained initialization.
Yet, we want to clarify that **RS and MS tackle different problems, have different goals, leading to different methods and implementations**.
- RS challenges single-policy approaches to improve alignment in reinforcement learning, and aims at reducing reward misspecification by revealing a Pareto front of solutions across the entire space of preferences: thus RS considers different training objectives for fixed hyperparameters across runs, and non-uniform interpolating coefficients $\lambda$ set a posteriori.
- In contrast, MS challenges the standard model selection after a grid search to improve generalization in supervised learning, and aims at reducing model underspecification and reducing variance by combining all fine-tuned models: thus MS considers different hyperparameters for a fixed training objective across runs, and (usually) uniform interpolating coefficients $\lambda=\frac{1}{M}$.
Overall, these differences mean that **MS cannot be applied to reduce reward misspecification**.
We refer R.DNE5 to the Figure 10.b from Appendix D.2 (reproduced and enriched in the Figure 4.c from the one page rebuttal), where we empirically validate this insight for the captioning task, when considering BLEU1 and ROUGE as rewards; specifically, "it presents the fronts described when we interpolate weights fine-tuned on a shared reward, as in model soups. This also only reveals a small portion of the spectrum of preferences, validating the need of diverse rewards to satisfy all users’ preferences" (this quote is from the legend in Figure 10.b).
In summary, regarding the exact statements from R.DNE5:
- "Interpolating weights for better performance is not a new concept": indeed, but we are the first to use weight interpolation for alignment, for models RL fine-tuned with different rewards, in particular for generative and multimodal tasks.
- "the authors did not provide any comparison with model soups": actually we already did, in Figure 10.b from Appendix D.2.
- "When we have $N$ fine-tuned models, rewarded soup can perform better than model soup?": RS will be better in terms of Pareto optimality. Yet, *if the true reward is available before training* and thus there is no reward misspecification, fine-tuning $N$ models on this exact reward (as in MS) will certainly provide better results.
- "For example, in the case of an image captioning task, the experimental setup assumes only two rewarded models, differently fined-tuned models on AVA and cafe datasets". In the image captioning task (Section 3.2), we consider multiple metrics such as BLEU1, BLEU4, ROUGE, and METEOR. In the image generation task (Section 3.3), we consider two models fine-tuned on reward models trained on AVA and cafe datasets. In the latter case, fine-tuning multiple times on the cafe reward would fail to improve the AVA reward, as "the model $\theta_{\text{cafe}}$ performs poorly in terms of AVA" (l.245).
- "how the proposed method is significantly better than other weight interpolation methods like model soup": RS is the only weight-interpolation method seeking Pareto optimality across diverse rewards, thus the other methods will only optimize a metric given a priori, without tackling reward misspecification.
As a final note, in the Figure 4.c from the one page rebuttal, we combine RS with MS, and plot for the captioning task
$\lambda \to \frac{1-\lambda}{2} \cdot (\theta_{BLEU1}^{v1} + \theta_{BLEU1}^{v2}) + \frac{\lambda}{2} \cdot (\theta_{ROUGE}^{v1} + \theta_{ROUGE}^{v2}),$ where $\theta_{BLEU1}^{v1}$ and $\theta_{BLEU1}^{v2}$ are from two independent RL fine-tunings on BLEU1 (and similarly for ROUGE). We confirm that MS mostly reduces variance, while interpolating weights fine-tuned on **different rewards** (as proposed in RS) is key to reveal the front across the entire space of preferences.
---
We hope this answer clarifies the difference between rewarded soups and model soups; we remain available for any further discussion.
# ntSF
We thank R.ntSF for the deep understanding of the paper, for highlighting its strengths and for asking two intriguing questions - that we try to answer below. Please let us know if there is anything else we can do to further strengthen our submission.
---
### Q1. How does the difference/gap of rewards affect the effectiveness of the MORL baseline and the rewarded soup?
Our experiments in captioning and image generation provide empirical evidence that **the more similar the rewards, the higher the gains of RS versus MORL**.
First, in the **captioning** experiment from in Figure 3.c, by analyzing the transfer abilities across the 4 main metrics (BLEU1, BLEU4, ROUGE, and METEOR), we can deduce that:
- BLEU4 and ROUGE are very similar.
- BLEU1 and BLEU4 are more similar than BLEU1 and ROUGE.
- METEOR is an outlier, quite different from other metrics, in particular from BLEU1.
Now having set these similarities across rewards, we observe that the gains of RS versus MORL are consistent with these similarities across rewards.
Specifically,
- with $R_1=\text{BLEU4}$ and $R_2=\text{ROUGE}$, we observe large performance gains for RS versus MORL (in Figure 8.a), where the green front is highly convex far above the solution provided by the MORL objective.
- with $R_1=\text{BLEU1}$, we observe larger gains (and cleaner convexity) for RS versus MORL with $R_2=\text{BLEU4}$ (in Figure 3.b) than with $R_2=\text{ROUGE}$ (in Figure 3.a).
- with $R_1=\text{BLEU1}$ and $R_2=\text{METEOR}$, we observe better performances for MORL than for RS (in Figure 8.b).
Overall all captioning rewards remain sufficiently similar to favor RS over MORL when combining all rewards in Figure 3.c.
Similarly, in the **image generation** experiment, when we consider two (arguably similar) aesthetic rewards in Figure 5.a to fine-tune a diffusion model, RS's front is to the right and above MORL's front. In contrast, in Appendix R.2 where we also include an *nsfw* reward "inversely correlated with image quality" (l.1058), then "MORL has higher scores than RS" (l.1056). This result "shows some limitations of weight interpolation when combining antagonist rewards" (l.1059).
In conclusion, this validates that "when the models are quite different (based on their objectives), the linear combination is likely to produce less favorable results", as anticipated by R.ntSF. This empirical limitation can be explained in two different ways:
- intuitively, from a **loss landscape perspective**, weights fine-tuned on diverse rewards will be more distant, thus potentially breaking the linear mode connectivity.
- theoretically, thanks to **Lemma 3 in Appendix B.2.2**, where we bound the difference between the optimal reward and RS's reward by a RHS term growing "the maximum of eigenvalues ratio" (l.942) for rewards' Hessians. This RHS term is illustrated in Figure 7. Then, if the rewards are more diverse, their Hessians would have more different eigenvalues, thus maximum of eigenvalues ratio would grow, the RHS term would grow in Lemma 3, and our guarantees for the optimality of RS would get loose.
Though these insights were already briefly mentioned in the main paper (we state l.140 that "we report a few limitations in Appendix and research directions to fix them"), but will be clarified in the revised version.
---
### Q2. How does the number of networks affect the results?
Though most of our experiments are with $N=2$ networks for visualization clarity, "RS can scale and trade-off between more rewards" (l.201).
We validate this empirically in the spider maps from Figure 2.f (for text generation), from Figure 3.c (for image captioning), and from Figure 5.c (for visual grounding), where we combine respectively $M=4$, $M=5$ and $M=3$ networks fine-tuned on $M$ rewards, one reward each.
Another possibility to scale the number of networks is at fixed number of rewards, by learning multiple networks on the same reward as in model soups (MS) [Wortsman2022].
We consider this in the Figure 4.c from the one page rebuttal (enriching the previous Figure 10.b from Appendix D.2) for the captioning task.
Specifically we successfully combine RS and MS and plot $\lambda \to \frac{1-\lambda}{2} \cdot (\theta_{BLEU1}^{v1} + \theta_{BLEU1}^{v2}) + \frac{\lambda}{2} \cdot (\theta_{ROUGE}^{v1} + \theta_{ROUGE}^{v2})$, where $\theta_{BLEU1}^{v1}$ and $\theta_{BLEU1}^{v2}$ are from two independent RL fine-tunings on BLEU1 (and similarly for ROUGE).
In conclusion, in all our experiments, **performances consistently increase for more networks**; when they are trained on different rewards, this reduces reward misspecification; when they are fine-tuned on the same reward, this reduces variance. This confirms the findings from previous works, eg, the Figure B.1 from model soups [Wortsman2022] which showed than increasing the number of averaged models consistently helps.
[Wortsman2022] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. ICML.
# bJvT
We thank R.bJvT for this review; we try to address the expressed concerns below.
---
### Q1. Novelty and difference with model soups (extended in [R.DNE5.Q1](https://openreview.net/forum?id=lSbbC2VyCu¬eId=6LMJAJD6vx))
Our approach is novel from two perspectives.
The first novelty is arguing for a **multi-objective paradigm** to align deep generative models with human preferences and reduce reward misspecification.
The second novelty is observing new setups/conditions where the **linear mode connectivity** (LMC) holds, and thus where weight interpolation can be used; for example, in reinforcement learning or for multimodal tasks.
This weight interpolation strategy was indeed used in model soups (MS). Yet, we want to clarify that **RS and MS tackle different problems, have different goals, leading to different methods and implementations**.
- RS challenges single-policy approaches to improve alignment in reinforcement learning, and aims at reducing reward misspecification by revealing a Pareto front of solutions across the entire space of preferences: thus RS considers different training objectives for fixed hyperparameters across runs, and non-uniform interpolating coefficients $\lambda$ set a posteriori.
- In contrast, MS challenges the standard model selection after a grid search to improve generalization in supervised learning, and aims at reducing model underspecification and reducing variance by combining all fine-tuned models: thus MS considers different hyperparameters for a fixed training objective across runs, and (usually) uniform interpolating coefficients $\lambda=\frac{1}{M}$.
Overall, these differences mean that **MS cannot be applied to reduce reward misspecification**. We will clarify this difference between RS and MS in the revised version of the paper.
---
### Q2. Diverse rewards
We respectfully disagree with R.bJvT, and argue that **we already use diverse and heterogeneous rewards that are in tension**. For example:
- for the summarization tasks (in Figure 1.b, 2.a and 2.b): $R_1$ evaluates completeness, while $R_2$ evaluates faithfulness.
- for the captioning experiments (in Figure 3.a and 8.b): BLEU1 measures accuracy while ROUGE evaluates recall, and METEOR captures synonyms.
- for the visual grounding experiments (in Figure 5.b and 14), the different rewards consider object of different sizes.
The dissimilarities between these rewards are quantitatively validated by our experiments; when fine-tuning on one reward, the performances are usually worsened on the others.
For example, for captioning "tuning solely BLEU1 sacrifices some points on ROUGE" (l.213); for visual grounding "optimizing for small objects degrades performance on large ones" (l.259).
These examples are arguably representative of "different reward functions learned from users with conflicting interests" (R.bJvT).
Yet, we acknowledge (in Appendix E.2 and in our response to [R.ntSF.Q1](https://openreview.net/forum?id=lSbbC2VyCu¬eId=UEM7DRMGYu)) some "limitations of weight interpolation when combining antagonist rewards" (l.1059). This was suggested by the results for text-to-image generation in Figure 10, where RS underperforms MORL when considering a *nsfw* reward "very different from aesthetic preferences" (l.1057); we argue this is because "this *nsfw* is inversely correlated with image quality" (l.1058). However, we want to emphasize that, in this kind of situation with fully antagonist rewards, the **complementarity of MORL and RS is a promising research direction**, as previously discussed l.1060: "an improved strategy would first learn the MORL [...], and then optimize each reward independently from this improved [MORL] initialization, before applying RS". As another research direction, we suggest (in the legend from Figure 10.a) that: "adding the MORL solutions as intermediate weights may help interpolate between two weights too distant".
---
### Q3. Quantitative efficiency gain (close duplicate of [R.QPfR.Q3](https://openreview.net/forum?id=lSbbC2VyCu¬eId=58jAIy2tXU))
Indeed, "the main strength of the proposed method is that it is more efficient in terms of training (fine-tuning) cost" (R.bJvT) than the MORL baseline. For example, as stated in the legend from Figure 1.b, "with only two trainings [RS] reveals the green front of Pareto-optimal solutions [...] and matches the costly yellow front of MORL requiring [11] trainings on different linear weightings".
As a side note, truly revealing the full MORL front would actually require an infinite number of trainings.
Therefore, we argue that this **efficiency gain is by design**; when considering $N$ rewards, RS only requires $M=N$ fine-tunings, while MORL "requires explicitly maintaining a large set $M \gg N$ networks, practically one for each possible preference" (l.105).
Indeed, as stated l.106, a critical issue in MORL is that “minor [preference] variations may result in significant changes in the solution. Thus, a high level of granularity in the mesh is necessary".
To quantify the efficiency gain of RS, we now provide an analysis in Figure 3 from the one-page rebuttal pdf, where we define a new measure of success; the expected reward $E_{\hat{\mu}\sim Unif\left(0,1\right)} \hat{R_{\hat{\mu}}}$ where $\hat{R_{\hat{\mu}}} = (1-\hat{\mu})\times R_1 + \hat{\mu} \times R_2$ and the expectation is over all the possible user's preferences $\hat{\mu}$. Then we compute the difference between (i) the expected reward for RS (always with $2$ training runs), and (ii) the expected reward for MORL with $M$ training runs. **Plotting this expected reward advantage for different values of $M$ confirms that MORL needs $M \gg 2$ to match RS**. Moreover, because of the dimensional curse, we expect the number of MORL trainings required to match RS to grow exponentially with the number of rewards $N$. In conclusion, these new experiments quantitatively validate that RS is more efficient than MORL, and will be included in the revised paper.
# QPfR
We appreciate the time R.QPfR took to review our paper. We hope that the responses below effectively address the expressed concerns.
---
### Q1. Empirical validation of Hypothesis 2
Our introduction to the Section 3 and the Remark 2 explain why "the [RS] front passing through the point obtained by MORL fine-tuning on the average of the two rewards support Hypothesis 2" (R.QPfR). Specifically, from l.176 to l.179, we state that: "as the true Pareto front is unknown in real-world applications, we present empirical support for Hypothesis 2 by comparing the front defined by RS (sliding $\lambda$ between $0$ and $1$) to the MORL's solutions optimizing the $\mu$-weighted rewards for $0\leq\mu\leq 1$ (sometimes only $\mu=0.5$ for computational reasons)". We provide more details l.151 to l.156: "Pareto-optimality in Hypothesis 2 is defined wrt a set of possible weights $\Theta$. Yet, in full generality, improvements in initialization, RL algorithms, data, or specific hyperparameters could enhance performances. In other words, for real-world applications, the true Pareto front is unknown and needs to be defined wrt a training procedure. In this case, $\Theta$ represents the set of weights attainable by fine-tuning within a shared procedure. As such, in Section 3 [and Figure 2] we analyze Hypothesis 2 by comparing the fronts obtained by RS and scalarized MORL".
Then in our text-to-image experiments from Section 3.3, we actually observe that "RS gives a better front than MORL", as it is above and to the right. In our locomotion experiments from Section 3.5, MORL for $\mu=0.5$ actually works slightly better. Overall, MORL and RS usually perform similarly, providing empirical support for the Pareto-optimality of RS.
---
### Q2. Theoretical analysis (extended in R.bXWy.Q2)
Indeed, the *full validation* of "Hypothesis 1 and 2 are based on empirical results" (R.QPfR). That"s why we state l.322: "RS relies on an empirical finding: the LMC, which currently lacks full theoretical guarantees, even in the simplest case of moving averages" in supervised learning.
Yet, we respectfully disagree with R.QPfR as **our work already gives theoretical analysis**, in particular in Appendix B.2 where "we provide theoretical guarantees for the near-optimality of RS when considering quadratic rewards" (l.908). Specifically, in Lemma 3, we bound the reward difference between the optimal policy and our interpolated policy. This is referenced in the main paper at two different places, where we state: (i) l.141-143 "we theoretically prove in Appendix B.2 [that our Hypotheses 1 and 2] approximately hold when rewards are replaced by their second-order Taylor expansion with co-diagonalizable Hessians"; and (ii) l.146 "when the weights remain close, we can theoretically justify Hypotheses 1 and 2 (see Appendix B.2) and, more broadly, demonstrate that WI approximates ensembling (see Lemma 4)".
---
### Q3. Computational costs (close duplicate of R.bJvT.Q3)
Indeed, "RS can reduce computational costs" (R.QPfR); as stated in Figure 1.b, "with only two trainings [RS] reveals the green front of Pareto-optimal solutions [...] and matches the costly yellow front of MORL requiring [11] trainings on different linear weightings over the rewards". As a side note, truly revealing the full MORL front would actually require an infinite number of trainings. Therefore, we argue that this **efficiency gain is by design**; when considering $N$ rewards, RS only requires $M=N$ fine-tunings, while MORL "requires explicitly maintaining a large set $M\gg N$ networks, practically one for each possible preference" (l.105).
To quantify the efficiency gain of RS, we provide an analysis in Figure 3 from the one-page rebuttal pdf, where we define a new measure of success; the expected reward $E_{\hat{\mu}\sim Unif\left(0,1\right)} \hat{R_{\hat{\mu}}}$ where $\hat{R_{\hat{\mu}}} = (1-\hat{\mu})\times R_1 + \hat{\mu} \times R_2$ and the expectation is over all the possible user's linear preferences $\hat{\mu}$ over the $N=2$ rewards. Then we compute the difference between (i) the expected reward for RS (always with $2$ training runs), and (ii) the expected reward for MORL with $M$ training runs. **Plotting this expected reward advantage for different values of $M$ confirms that MORL needs $M \gg 2$ to match RS**. Moreover, because of the dimensional curse, we expect the number of MORL trainings required to match RS to grow exponentially with the number of rewards $N$. In conclusion, these new experiments quantitatively validate that RS is more efficient than MORL, and will be included in the revised paper.
---
### Q4. Number of rewards (extended in R.ntSF.Q2)
For visualization clarity, fhe Pareto fronts were shown for $N=2$ rewards, one of the $x$-axis, the other on the $y$-axis. Yet, "RS can scale and trade-off between more rewards" (l.201). We validate this empirically in the spider maps from Figure 2.f (for text generation), from Figure 3.c (for image captioning), and from Figure 5.c (for visual grounding), where we respectively consider $N=4$, $N=5$ and $N=3$ different rewards.
---
### Q5. Formulas clarity
Yes, both formulas refer to $\\{ \lambda_i \\}_{i=1}^N$. Bounds will be explicited in the revision.
---
### Q6. Selecting the $\lambda$
As detailed l.163, and latter l. 223, we already **consider two practical strategies to select the values** of $\lambda$ coefficients:
1. if the user defines a linear preference $\hat{\mu}$, we can select $\lambda=\hat{\mu}$ .
2. if the user provides some labelled validation samples, we can cross-validate $\lambda$.
We validate in Figure 4.a that both strategies perform well. If the user only provides preference comparisons (as suggested by R.QPfR), we could indeed select the $\lambda$ similarly as in reward modeling in RLHF.
---
We would greatly appreciate it if you took those clarifications into account during discussions.
# bXWy
We thank R.bXWy for reviewing our work. Yet, with all due respect, there is an inaccuracy in the summary by R.bXWy: "at test time, [we do **not**] infer a reward as a linear combination of these proxy rewards".
---
### Q1. Comparisons with other MORL strategies (see [R.QPfR.Q1](https://openreview.net/forum?id=lSbbC2VyCu¬eId=58jAIy2tXU))
As stated l.292, "when dealing with multiple objectives in deep learning, the common strategy is to combine them into a single reward"; in particular, the linear MORL is now standard to train LLMs with RLHF (see [Glaese2022] or the recent [Wu2023]). That's why, and "as the true Pareto front is unknown in real-world applications" (l.176 and Remark 2), we consider this linear MORL as the reference to evaluate the Pareto-optimality of our approach. Our conclusion was that rewarded soups (RS) is an empirical solution **towards** Pareto-optimality, with indeed a limitation highlighted in the paper's name.
Now, regarding the other MORL strategies, please note that **they are not practical** for large-scale experiments, as acknowledged by R.stHc who stated: "compared with previous work, [our approach is] much more applicable and flexible to complex application scenarios".
For example, (i) "these works are mostly for **academic benchmarks**" (l.299) or "games such as ATARI" (l.890), and (ii) none have been used for RLHF, for fine-tuning foundation models, or for deep networks with billions of parameters.
Critically, their implementations are complex, as most introduce **specific hyperparameters** or even "**modify the training procedure**" (l.300); for example, the reference 130 [Yang2019] requires a change in Bellman equations. In contrast, RS requires zero modification to the optimization algorithm (such as PPO), and thus can be used on top of any RLHF system (such as trl).
If R.bXWy is aware of any open-source implementation of any multi objective RL working with PPO for RLHF of LLMs, we would be pleased to run the experiments and include those in the revision.
As a side note, **performance and simplicity are not the only advantages of RS over other MORL strategies, as discussed at length in Appendix A.2**.
In brief, RS "is compatible with the inherent iterative engineering process of alignment" (l.890): "RS can continually include adjusted opinions while preventing forgetting of the old behaviours" (l.891). For example, if a new reward is defined, RS simply requires one additional training; in contrast, the other MORL strategies would require starting again from scratch.
[Glaese2022] Improving alignment of dialogue agents via targeted human judgements.\
[Wu2023] Fine-Grained Human Feedback Gives Better Rewards for Language Model Training.\
[Yang2019] A generalized algorithm for multi objective reinforcement learning and policy adaptation. NeurIPS.
---
### Q2. LMC and optimal policy
Our Hypothesis 1 tries to properly define the LMC when considering multiple metrics: R.TSwH "found [it] to be very helpful in understanding how RS works". Its empirical validation was arguably far from obvious; yet, we consistently obtain positive results in Section 3, for various setups and scenarios, even for generation task involving long term dependencies such as text-to-text with LLaMA, or image generation with diffusion models. Then, we state l.322: "RS relies on an empirical finding: the LMC, which currently lacks full theoretical guarantees [in our complex RL setup with multiple rewards, but actually] even in the simplest case of moving averages [Izmailov2018]" in supervised learning with one single set of labels.
However, we'd like to respectfully emphasize that **we do provide a theoretical and novel "argument in this paper"**; in Appendix B.2 "we provide theoretical guarantees for the near-optimality of RS when considering quadratic rewards" (l.908). This is referenced in the main paper l.146 in Remark 1 and also l.141-143, where we state: "we theoretically prove in Appendix B.2 [that our Hypotheses 1 and 2] approximately hold when rewards are replaced by their second-order Taylor expansion with co-diagonalizable Hessians, a simplified setup justifiable when weights remain close", and a common assumption in deep learning (as argued in Remark 4). Specifically, considering a linear preference $\hat{\mu}$ over two rewards $R_1$ and $R_2$, Lemma 3 bounds the difference $\Delta R_{\hat{\mu}}$ between the reward obtained by (i) the optimal policy and (ii) our interpolated solution by:
$$ \Delta R_{\hat{\mu}} \leq \frac{\hat{\mu}^2(1-\hat{\mu})^2(M \Delta_1 - \Delta_2)(M \Delta_2 - \Delta_1)}{\left(\hat{\mu}(1-\hat{\mu})(M-1)^2 + M\right)\left(\left( 1 - \hat{\mu}\right) \Delta_1 + \hat{\mu} \Delta_2\right)},$$
where $M$ is the maximum of eigenvalues ratio for rewards' Hessians, $\Delta_1 = R_1(\theta_1) - R_1(\theta_2)$ and $\Delta_2 = R_2(\theta_2) - R_2(\theta_1)$. This bound is illustrated in Figure 7.
In conclusion, we **provide guarantees with assumptions on the rewards** being quadratic and co-diagonalizable, thus with indirect "assumptions on the structure of the optimal policy" (R.bXWy). This is acknowledged by R.TSwH who stated: "the approach is well-motivated theoretically" and that "the theory part connects with experiments very well". This theoretical analysis will be put forward in the revision.
[Izmailov2018] Averaging weights leads to wider optima and better generalization. UAI.
---
If this clarifies our empirical and theoretical analyses, we would be extremely grateful if R.bXWy could update their review accordingly.
# TSwH
We would like to thank R.TSwH for this positive review and the great understanding of the empirical and theoretical components of our work.
---
### Q1. Reward misspecification
In the revised version of the paper, we will clarify the discussion l.75 and l.161 on reward misspecification being for linear rewards. Yet, please note that in Figure 4b (and in Figure 9 from Appendix D.2) we actually observe that "despite the lack of theoretical guarantees, weight interpolation improves results **even for non-linear reward**" (l.167). We speculate we actually maximize the projection of the user's reward on the linear subspaces defined by the different proxy rewards.
---
### Q2. Smoothing functions in plots
The curves fit the points with a **Savitzky-Golay smoothing** (inspired from this [blog](https://www.datatechnotes.com/2022/05/smoothing-example-with-savitzky-golay.html)) and a **quadratic interpolation** (inspired from this [stack overflow](https://stackoverflow.com/questions/52014197/how-to-interpolate-a-2d-curve-in-python)). The code is detailed below.
```python
import numpy as np
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
def smoothing(x, y):
x_smooth = savgol_filter(x, 3, 1)
x_smooth[0], x_smooth[-1] = x[0], x[-1]
y_smooth = savgol_filter(y, 3, 1, mode="nearest")
y_smooth[0], y_smooth[-1] = y[0], y[-1]
points = np.array([x_smooth, y_smooth]).T
distance = np.cumsum(np.sqrt(np.sum(np.diff(points, axis=0)**2, axis=1)))
distance = np.insert(distance, 0, 0) / distance[-1]
alpha = np.linspace(0, 1, 75)
interpolator = interp1d(distance, points, kind="quadratic", axis=0)
curve = interpolator(alpha)
return curve.T
```
---
### Q3. How much does the performance of RS depend on policies being close to their shared base models? How does the RS front evolve over the course of finetuning; does it degenerate after some number of gradient updates? (see [R.ntSF.Q1](https://openreview.net/forum?id=lSbbC2VyCu¬eId=UEM7DRMGYu))
As detailed in Remark 1, "when the weights remain close, we can theoretically justify Hypotheses 1 and 2 (see Appendix B.2), and, more broadly, demonstrate that WI approximates ensembling (see Lemma 4 [in Appendix B.3])" (l.146). In other words, good performances are guaranteed when weights are close; thus longer trainings may be worrisome, as the models may potentially diverge in the weight space.
We investigate this question in the one-page rebuttal pdf, for the news summarization task (in Figure 4.a) and for the captioning task (in Figure 4.b); we double the number of training steps, and report multiple RS fronts over the course of fine-tuning.
Fortunately, we **consistently observe good performances for RS along fine-tuning**, confirming that the pre-trained initialization is sufficient to enforce the LMC, validating the insights from previous works [Neyshabur2020].
More precisely, Figure 4.b suggests that the convexity of the interpolated lines is maximal after a few epochs, and then reduces progressively.
This suggests that there exists an ideal number of steps, where the models remain close enough in the weight space, but are already sufficiently diverse to be specialized on the different rewards.
[Neyshabur2020] What is being transferred in transfer learning? NeurIPS.
---
### Q4. Does RS work without LoRA?
Actually, **most of our experiments are without LoRA**.
- for the captioning task, we usually fine-tune the text decoder with the convolutional visual encoder frozen, but we show in Figure 10.d that RS convexity is actually even better when training end-to-end.
- for the diffusion task, we fine-tune 10% of the weights, "corresponding to the cross-attention layers and the bias/scaling parameters" (l.1047).
- for the visual grounding task, we fine-tune the transformer end-to-end.
- for the locomotion task, we fine-tune the MLP end-to-end.
Therefore, we argue that RS is agnostic to the training procedure.
As a final note, [Li2022] have observed in NLP that weight interpolation works even better in larger architectures. Indeed, a large number of parameters may facilitate the orthogonality of the fine-tuned updates observed in [Ilharco2023], which "speculate that this [orthogonality] enables the combination of task vectors via addition with minimal interference". This fact may explain why end-to-end fine-tuning in captioning provides better convexity in Figure 10.d than when keeping the visual encoder frozen in Figure 3.a. Moreover, this insight suggests that, as LoRA reduces the number of trainable weights, **performances might actually get better with full end-to-end fine-tuning than with LoRA** (as currently done in our text-to-text experiments). This is a promising research direction for future work.
[Li2022] Branch-Train-Merge: Embarrassingly parallel training of expert language models.\
[Ilharco2023] Editing models with task arithmetic. ICLR.
---
Thank you once more for your feedback. We remain open to further suggestions and discussions.
# Keep in mind
Finally, regarding the inference cost, please note that we are not more efficient than MORL: actually, both methods use a single network at inference. The only more expensive approach at inference is the ensembling of the predictions, introduced as an oracle in Figure 4.c; we show in Lemma 4 from Appendix B.3 that weight interpolation is actually a cheap approximation of prediction interpolation.
In a more general case, guarantees can be obtained thanks to the similarity between weight averaging and the (more costly) averaging of predictions, detailed in , we "demonstrate that [weight interpolation] approximates ensembling" (l.147) . Then, considering an input $x$, weights $\theta_1$ defining an optimal policy for reward $R_1$ and $\theta_2$ for $R_2$, the success of our weight interpolated policy $f(x, (1-\lambda) \cdot \theta_1 + \lambda \cdot \theta_2)$ comes down to the success of the interpolation of policies $(1-\lambda) \times f(x, \theta_1) + \lambda \times f(x, \theta_2)$ to approximate the optimal policy for the interpolated rewards $f(x, \theta_{(1-\lambda)\cdot R_1+ \lambda \cdot R_2})$.
Said differently, the required assumptions for RS come down to those required for prediction ensembling, which is fortunately a classical strategy in reinforcement learning in [Kurutach2018,Chua2018,Nagabandi2019] or in multitask (as shown in the recent [Dimitriadis2023]).
[Chua2018] Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NeurIPS.\
[Nagabandi2019] Deep dynamics models for learning dexterous manipulation. NeurIPS.\
[Hua2023] Simple emergent action representations from multi-task policy training. ICLR.\
[Dimitriadis2023] Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models. ICML.
The "best existing explanation [from the literature] relies on the similarities between weight interpolation and functional ensembling" (l.324) of predictions, as recalled in Lemma 4 from Appendix B.3; indeed, averaging the predictions is a successful strategy both in supervised learning [Lakshminarayanan2017] and RL [Rajeswaran2017,Kurutach2018,Lee2021].
[Izmailov2018] Averaging weights leads to wider optima and better generalization. UAI.\
[Lakshminarayanan2017] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS.\
[Rajeswaran2017] EPOpt: learning robust neural network policies using model ensembles. ICLR.\
[Kurutach2018] Model-ensemble trust-region policy optimization. ICLR.\
[Lee2021] SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. ICML.