owned this note
owned this note
Published
Linked with GitHub
# Reviewer ZFmc
We thank the reviewer for their valuable comments. We are grateful that the reviewer found that MPROG has superior performance while requiring less resources and doesn't modify the weights of the LLM that is edited.
We will respond to questions and concerns as follows:
**References to relevant works are missing, such as those targeting MLP layers of transformers (ROME/MEMIT). Similarly, the experimental setting is missing comparisons to the methods described above....**
- MEMIT and ROME underperform SERAC in copy-edit datasets like ZSRE
*Not sure how to respond well*
**In addition, adding the edits as a prefix prompt during decoding has similarities to the FiD method which is not referenced.**
FiD, similar to other RAG methods, uses a module to retrieve relevant datapoints from training data to augment the decoder which has similarities with our idea of retriving relevant edits via the edit selection modules. We will add it to related works when mentioning RAG.
**MPROG does not consistently outperform memory-based baselines, losing to SERAC on "copy edit" datasets.**
MPROG was designed to adapt to both kinds of edits found in copy-edit and entail-edit datasets found in most application. Further, MPROG sligtly underperforms SEPROG in copy-edit domain while other baselines's performance is significantly worse. Further, using SEPROG on entail-edit datasets leads to poor performance. MPROG, therefore, is the first to provide a single model that provides good performance in both domains while being computationally efficient.
**Is the split to "copy-edit"/"entail-edit" datasets based on previous work? Why are datasets that are based on KGs (zsRE/Wikidata5m) in different sets? Is it only because of the more challenging out-scope examples? If so, what explains for the difference in the ES score between SERAC and MPROG?**
To the best of our knowledge, we are the first to make this clear distinction on edit benchmarks based on requirement of model to capture complax entailment relationships between edit dataset, input and background knowledge from pre-training. Due to these harder requirements, it is harder to distinguish in-scope and out-scope inputs that are related to the edits. Since, memory-based models do not leverage backgroung knowledge from LLM to distinguish and appropriately respond to this they show low ES score by being able to provide accurate predictions to harder in-scope examples.
**Was the Wikipedia Text Generation task based on previous work? If not, will it be released with the paper?**
Wikipedia Text Generation was introduced in MEND paper [Mitchell et al. ICLR 22]. We thank the reviewer and we will add the reference.
**Why are different backbone models used for different tasks (lines 236-243)? How are encoder only models (BERT) used if the focus is on encoder-decoder models?**
We use the same base models as used in previous works in the baselines for a fair comparison (line 236). However, we note that our method makes no assumption about the underlying model structure. The same encoder can be used in different tasks. In case of FEVER which uses BERT as base model, we use it to encode the input $b(x)$ and insert the the prompt from edit selection module into the first layer of BERT used for providing the predictions. We will clarify this to avoid confusion.
**Are all results across one seed? How many examples are in LoT/Wikidata5m? Are results statistically significant or reproduced between seeds?**
The results produced are averaged over 5 independent runs. We did not observe any significant variance in the results. In all benchmarks, the top-performing model's performance is statistically significant (via t-test with p<0.05).
**Do any of the tasks require editing counterfactuals? Did you check if the models know the edited facts, pre-edit?**
The pre-trained models are typically trained on large corpora of knowledge from various real-world sources.
The edited facts are typically selected such that they are negations or simple modifications of facts from trained dataset to render them false, from popular sources such as from wikipedia. While we expect the base models to have correct beliefs corresponding to unedited facts in most if not all cases, we did not explicitly test that.
**Other limitations, such as the potential impact of editing failures or an analysis of the method's errors are not discussed.**
Our framework can be applied to many domains involving text generation and therefore span wide range of applications. We mentioned briefly regarding sanity checking the model and edit datasets to avoid problems like misinformation spread as well as ensuring fairness and equity in overall model's impact.
While these are not exhaustive, we are happy to add other potentially important limitations the reviewer can poitn to.
# Reviewer XTKo
**Reviewer XTKo**
We thank the reviewer for their valuable comments. We are happy the reviewer appreciates the novelty of leveraging prompt generation as an effective state-of-art method for different kinds of edit dataset.
**Another simple baseline to add is to simply retrieve k-most semantically similar examples from the edit database and add those examples into prompts. This baseline can be unsupervised, which would be good to understand how much benefits the supervised training brings.**
We thank the reviewer for the suggestion. We would like to point out that in the SEPROG paper, the authors did consider a similar baseline, where they pick the most relvant example from edit database and add it as a prompt to the *base decoder*. This baseline vastly underperformed SEPROG, even in the copy-edit" benchmarks. However, the model to choose the relvant prompt was trained in supervised manner. We would expect the unsupervised version to perform even more poorly. We would be happy to add this baseline to the paper as well.
# Reviewer BuDA
**What would happen if the number of edits is extremely large, for example, on a scale of 5k. Would the cross attention then be very inefficient and slow? Also, when the number of edits increases, would the cross attention fail to encode the most relevant edits, and hurt the performance**
The runtime complexity of MPROG should scale *linearly* with number of edit examples: The edit encoder generates an embedding for each edit example independently and the attention layer of edit selection module has attends to each of the edit examples with input embedding $b(x)$ as the only query. This complexity is on par with SERAC which independently scores each edit example for relevancy to input and selects the most relevant one. Therefore, while larger edit size will require more compute, it scales linearly.
Adding extremely large number of edits to input makes it harder for edit selection module to choose the most relevant edit leading to average decrease in performance. However, this phenomenon is observed for other baselines as well and is especially more pronounced in gradient-based methods. We appreciate the author's comment and believe that a thorough study on data limits of editing methods is a useful research question.
**What's the success rate of the scope classifier across the set of tasks you consider?**
| Model/Dataset | ZSRE | FEVER | Wikipedia | LoT | Wikidata5m |
|---------------|------|-------|-----------|------|------------|
| SERAC | 0.98 | 0.91 | 0.95 | 0.62 | 0.58 |
| MPROG | 0.95 | 0.97 | 0.93 | 0.75 | 0.84 |
We observe that the success rate (accuracy) of scope classification is similar across both SERAC and MPROG for copy-edit dataset. For entail-edit dataset we observe higher accuracy for MPROG. This could be attributed both a more systematic approach of leveraging attention over all edit datapoints as well as end-toend training of all components during generation. In contrast, SERAC trains the individual components for scope classification and generation seperately.
# Reviewer uJyY
We thank the reviwer for valuable feedback and are grateful that they apprciate the effectiveness and efficiency of our prompt-generatin based approach. We would like to respond to some of their comments:
**Some missing reference..**
We will add and characterize the missing reference pointed to by the reviewer.
**While cross attention is less costly than self-attention there doesn't seem to be an exploration of how this method scales with more edits. Both in terms of computation (more things to attend to) and in terms of performance (as more relevant edits are added can all the relevant information be squeeze into the attention vector bottleneck).**
The runtime complexity of MPROG should scale *linearly* with number of edit examples: The edit encoder generates an embedding for each edit example independently and the attention layer of edit selection module has attends to each of the edit examples with input embedding $b(x)$ as the only query. This complexity is on par with SERAC which independently scores each edit example for relevancy to input and selects the most relevant one. Therefore, while larger edit size will require more compute, it scales linearly.
Adding extremely large number of edits to input makes it harder for edit selection module to choose the most relevant edit leading to average decrease in performance. However, this phenomenon is observed for other baselines as well and is especially more pronounced in gradient-based methods. We appreciate the author's comment and believe that a thorough study on data limits of editing methods is a useful research question.
**Was there any work at looking at the interpretability of the generated parameters? For example, can the edits used be inferred from the generated parameters?**
We assume the authors meant prompts. Since the generated prompts are real-valued embeddings, it is not straighforward to map them to interpretable tokens. We did not observe any simple relation between genrated prompts and chosen edits. However, we agree that a systematic study of interpretability for prompt-generation methods in general is an interesting research direction.
# Reviewer jYjK
We thank the reviewer for their valuable comments and for appreciating our experimental results and ablations. We would like to address comments in the review.
**The model is similar to the work "Structured Prompting: Scaling In-Context Learning to 1,000 Examples", although the target tasks differ. The backbone of this work has better based on the real large models with at least over 10B parameters. And it would be better to test the proposed method in more popular settings like Structured Prompting work.**
The referenced work is a prompt generation method for few-shot learning where the input attends over subsets of examples from downstream tasks to generate the prompt. While this work is used form adapting the LLM to a single downstream task where the demonstrations are fixed for the task.
In contrast, we train MPROG to adapt to potentially different edit examples for each inference during testing. We also have a classification module to make sure that the input is in-scope as well.
We thank the reviewer for pointing out connections between our work and MPROG and we will add it to related works.
**The proposed method still cannot beat the baseline on 3 out of 6 tasks.**
MPROG was designed to adapt to both kinds of edits found in copy-edit and entail-edit datasets found in most application. Further, MPROG sligtly underperforms SEPROG in copy-edit domain while other baselines's performance is significantly worse. Further, using SEPROG on entail-edit datasets leads to poor performance. MPROG, therefore, is the first to provide a single model that provides good performance in both domains while being computationally efficient and adaptable to large batch sizes.
**Crossattention in Equation 1 is confusing**
We meant to indicate that $b(x)$ attentds over each of the edit embeddings $H_e$ and aggregates them to single embedding $h(x)$. Therefore, we derive the keys and queries from $H_e$ and key is $b(x)$. We will make this more clear with notational representation.
**How to get embeddings of edit encoder. Does it use the mean pooling?**
Yes, as done in practice, the embeddings are derived from mean pooling of final layer embeddings.
**How to select N in H^e ? For different datasets, is N the same for all datasets?**
Here, $N$ is the number of edit descriptors in edit bacth. This is part of the dataset during inference and can vary depending on the application. Therefore, we tested the efficacy of different edit batch sizes and found MPROG provides consistent performance across varying edit batch sizes for all edit benchmarks (Figure 3).
Note: We note that we used $K$ as edit batch size in Section 2. We will standardize this notation to avoid confusion.
**Not quite clear why the performance of FT is quite low? Is it full-parameter tuning?**
FT performs fine-tuning on all parameters of base model. This typically leads to overfitting as well as catastrophic forgetting of unrelated examples. Therefore, gradient based approaches try to overcome this by learning an optimal gradient update method from the data [Mitchell et al. 2022 ICLR].
# Responce to AC
**Comparing to MEMIT seems to be much more than "nice-to-have" and actually critical for comprehensively evaluating your approach since MEMIT is designed for editing large batches in an efficient manner. Do you agree with this point or see this differently (and if so, why)?**
We agree that MEMIT is designed for editing large edit batches.
MEMIT and similar methods like MEND, ENN and SLAG learn to modify the weights of LLM to update thier beliefs to the edit descriptors. Our method introduces a novel method of leveraging prompt generation to inform the LLM of relevant belief without changing any parameters of the model. This allows MPROG to tailor the generated prompt to any edit from edit dataset depending on the input efficiently for LLM of any size.
We would gladly add the comparison with MEMIT to revised version.
**Moreover, is there any support for your claims in the response saying that "the ... performance will drop with a larger number of edits. However, ours drops with a milder slope, ..."? It might be that I missed something, but is there such a comparison (even at a small-scale) for that?**
Yes, we indeed compared the drop in performance of MPROG with other baselines with increase in number of edits (from 1 to 128 edits) in lines 269-277 and Figure 3. We observe that MPROG's drop in performance is consistently low compared to gradient-based methods whose performance drop rapidly with larger number of edits.
**Qualitative analysis / error analysis: This seems to be important as evaluation is done using basic metrics.**
We have used the same standard evaluation metrics as done by previous works like MEND, ENN, SERAC, etc.. We have also provided ablation analyisis of various important novel architectural choices, analysis of compute for scaling to large models as well as performance over increasing number of edits. Previous works such as MEND, ENN, SERAC have also performed similar if not a subset of the analysis.
Additionally, we would be happy to add to the Appendix, specific examples where MPROG performs correct predictions where other baselines fail.