Harshavardhan Kamarthi
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Reviewer ZFmc We thank the reviewer for their valuable comments. We are grateful that the reviewer found that MPROG has superior performance while requiring less resources and doesn't modify the weights of the LLM that is edited. We will respond to questions and concerns as follows: **References to relevant works are missing, such as those targeting MLP layers of transformers (ROME/MEMIT). Similarly, the experimental setting is missing comparisons to the methods described above....** - MEMIT and ROME underperform SERAC in copy-edit datasets like ZSRE *Not sure how to respond well* **In addition, adding the edits as a prefix prompt during decoding has similarities to the FiD method which is not referenced.** FiD, similar to other RAG methods, uses a module to retrieve relevant datapoints from training data to augment the decoder which has similarities with our idea of retriving relevant edits via the edit selection modules. We will add it to related works when mentioning RAG. **MPROG does not consistently outperform memory-based baselines, losing to SERAC on "copy edit" datasets.** MPROG was designed to adapt to both kinds of edits found in copy-edit and entail-edit datasets found in most application. Further, MPROG sligtly underperforms SEPROG in copy-edit domain while other baselines's performance is significantly worse. Further, using SEPROG on entail-edit datasets leads to poor performance. MPROG, therefore, is the first to provide a single model that provides good performance in both domains while being computationally efficient. **Is the split to "copy-edit"/"entail-edit" datasets based on previous work? Why are datasets that are based on KGs (zsRE/Wikidata5m) in different sets? Is it only because of the more challenging out-scope examples? If so, what explains for the difference in the ES score between SERAC and MPROG?** To the best of our knowledge, we are the first to make this clear distinction on edit benchmarks based on requirement of model to capture complax entailment relationships between edit dataset, input and background knowledge from pre-training. Due to these harder requirements, it is harder to distinguish in-scope and out-scope inputs that are related to the edits. Since, memory-based models do not leverage backgroung knowledge from LLM to distinguish and appropriately respond to this they show low ES score by being able to provide accurate predictions to harder in-scope examples. **Was the Wikipedia Text Generation task based on previous work? If not, will it be released with the paper?** Wikipedia Text Generation was introduced in MEND paper [Mitchell et al. ICLR 22]. We thank the reviewer and we will add the reference. **Why are different backbone models used for different tasks (lines 236-243)? How are encoder only models (BERT) used if the focus is on encoder-decoder models?** We use the same base models as used in previous works in the baselines for a fair comparison (line 236). However, we note that our method makes no assumption about the underlying model structure. The same encoder can be used in different tasks. In case of FEVER which uses BERT as base model, we use it to encode the input $b(x)$ and insert the the prompt from edit selection module into the first layer of BERT used for providing the predictions. We will clarify this to avoid confusion. **Are all results across one seed? How many examples are in LoT/Wikidata5m? Are results statistically significant or reproduced between seeds?** The results produced are averaged over 5 independent runs. We did not observe any significant variance in the results. In all benchmarks, the top-performing model's performance is statistically significant (via t-test with p<0.05). **Do any of the tasks require editing counterfactuals? Did you check if the models know the edited facts, pre-edit?** The pre-trained models are typically trained on large corpora of knowledge from various real-world sources. The edited facts are typically selected such that they are negations or simple modifications of facts from trained dataset to render them false, from popular sources such as from wikipedia. While we expect the base models to have correct beliefs corresponding to unedited facts in most if not all cases, we did not explicitly test that. **Other limitations, such as the potential impact of editing failures or an analysis of the method's errors are not discussed.** Our framework can be applied to many domains involving text generation and therefore span wide range of applications. We mentioned briefly regarding sanity checking the model and edit datasets to avoid problems like misinformation spread as well as ensuring fairness and equity in overall model's impact. While these are not exhaustive, we are happy to add other potentially important limitations the reviewer can poitn to. # Reviewer XTKo **Reviewer XTKo** We thank the reviewer for their valuable comments. We are happy the reviewer appreciates the novelty of leveraging prompt generation as an effective state-of-art method for different kinds of edit dataset. **Another simple baseline to add is to simply retrieve k-most semantically similar examples from the edit database and add those examples into prompts. This baseline can be unsupervised, which would be good to understand how much benefits the supervised training brings.** We thank the reviewer for the suggestion. We would like to point out that in the SEPROG paper, the authors did consider a similar baseline, where they pick the most relvant example from edit database and add it as a prompt to the *base decoder*. This baseline vastly underperformed SEPROG, even in the copy-edit" benchmarks. However, the model to choose the relvant prompt was trained in supervised manner. We would expect the unsupervised version to perform even more poorly. We would be happy to add this baseline to the paper as well. # Reviewer BuDA **What would happen if the number of edits is extremely large, for example, on a scale of 5k. Would the cross attention then be very inefficient and slow? Also, when the number of edits increases, would the cross attention fail to encode the most relevant edits, and hurt the performance** The runtime complexity of MPROG should scale *linearly* with number of edit examples: The edit encoder generates an embedding for each edit example independently and the attention layer of edit selection module has attends to each of the edit examples with input embedding $b(x)$ as the only query. This complexity is on par with SERAC which independently scores each edit example for relevancy to input and selects the most relevant one. Therefore, while larger edit size will require more compute, it scales linearly. Adding extremely large number of edits to input makes it harder for edit selection module to choose the most relevant edit leading to average decrease in performance. However, this phenomenon is observed for other baselines as well and is especially more pronounced in gradient-based methods. We appreciate the author's comment and believe that a thorough study on data limits of editing methods is a useful research question. **What's the success rate of the scope classifier across the set of tasks you consider?** | Model/Dataset | ZSRE | FEVER | Wikipedia | LoT | Wikidata5m | |---------------|------|-------|-----------|------|------------| | SERAC | 0.98 | 0.91 | 0.95 | 0.62 | 0.58 | | MPROG | 0.95 | 0.97 | 0.93 | 0.75 | 0.84 | We observe that the success rate (accuracy) of scope classification is similar across both SERAC and MPROG for copy-edit dataset. For entail-edit dataset we observe higher accuracy for MPROG. This could be attributed both a more systematic approach of leveraging attention over all edit datapoints as well as end-toend training of all components during generation. In contrast, SERAC trains the individual components for scope classification and generation seperately. # Reviewer uJyY We thank the reviwer for valuable feedback and are grateful that they apprciate the effectiveness and efficiency of our prompt-generatin based approach. We would like to respond to some of their comments: **Some missing reference..** We will add and characterize the missing reference pointed to by the reviewer. **While cross attention is less costly than self-attention there doesn't seem to be an exploration of how this method scales with more edits. Both in terms of computation (more things to attend to) and in terms of performance (as more relevant edits are added can all the relevant information be squeeze into the attention vector bottleneck).** The runtime complexity of MPROG should scale *linearly* with number of edit examples: The edit encoder generates an embedding for each edit example independently and the attention layer of edit selection module has attends to each of the edit examples with input embedding $b(x)$ as the only query. This complexity is on par with SERAC which independently scores each edit example for relevancy to input and selects the most relevant one. Therefore, while larger edit size will require more compute, it scales linearly. Adding extremely large number of edits to input makes it harder for edit selection module to choose the most relevant edit leading to average decrease in performance. However, this phenomenon is observed for other baselines as well and is especially more pronounced in gradient-based methods. We appreciate the author's comment and believe that a thorough study on data limits of editing methods is a useful research question. **Was there any work at looking at the interpretability of the generated parameters? For example, can the edits used be inferred from the generated parameters?** We assume the authors meant prompts. Since the generated prompts are real-valued embeddings, it is not straighforward to map them to interpretable tokens. We did not observe any simple relation between genrated prompts and chosen edits. However, we agree that a systematic study of interpretability for prompt-generation methods in general is an interesting research direction. # Reviewer jYjK We thank the reviewer for their valuable comments and for appreciating our experimental results and ablations. We would like to address comments in the review. **The model is similar to the work "Structured Prompting: Scaling In-Context Learning to 1,000 Examples", although the target tasks differ. The backbone of this work has better based on the real large models with at least over 10B parameters. And it would be better to test the proposed method in more popular settings like Structured Prompting work.** The referenced work is a prompt generation method for few-shot learning where the input attends over subsets of examples from downstream tasks to generate the prompt. While this work is used form adapting the LLM to a single downstream task where the demonstrations are fixed for the task. In contrast, we train MPROG to adapt to potentially different edit examples for each inference during testing. We also have a classification module to make sure that the input is in-scope as well. We thank the reviewer for pointing out connections between our work and MPROG and we will add it to related works. **The proposed method still cannot beat the baseline on 3 out of 6 tasks.** MPROG was designed to adapt to both kinds of edits found in copy-edit and entail-edit datasets found in most application. Further, MPROG sligtly underperforms SEPROG in copy-edit domain while other baselines's performance is significantly worse. Further, using SEPROG on entail-edit datasets leads to poor performance. MPROG, therefore, is the first to provide a single model that provides good performance in both domains while being computationally efficient and adaptable to large batch sizes. **Crossattention in Equation 1 is confusing** We meant to indicate that $b(x)$ attentds over each of the edit embeddings $H_e$ and aggregates them to single embedding $h(x)$. Therefore, we derive the keys and queries from $H_e$ and key is $b(x)$. We will make this more clear with notational representation. **How to get embeddings of edit encoder. Does it use the mean pooling?** Yes, as done in practice, the embeddings are derived from mean pooling of final layer embeddings. **How to select N in H^e ? For different datasets, is N the same for all datasets?** Here, $N$ is the number of edit descriptors in edit bacth. This is part of the dataset during inference and can vary depending on the application. Therefore, we tested the efficacy of different edit batch sizes and found MPROG provides consistent performance across varying edit batch sizes for all edit benchmarks (Figure 3). Note: We note that we used $K$ as edit batch size in Section 2. We will standardize this notation to avoid confusion. **Not quite clear why the performance of FT is quite low? Is it full-parameter tuning?** FT performs fine-tuning on all parameters of base model. This typically leads to overfitting as well as catastrophic forgetting of unrelated examples. Therefore, gradient based approaches try to overcome this by learning an optimal gradient update method from the data [Mitchell et al. 2022 ICLR]. # Responce to AC **Comparing to MEMIT seems to be much more than "nice-to-have" and actually critical for comprehensively evaluating your approach since MEMIT is designed for editing large batches in an efficient manner. Do you agree with this point or see this differently (and if so, why)?** We agree that MEMIT is designed for editing large edit batches. MEMIT and similar methods like MEND, ENN and SLAG learn to modify the weights of LLM to update thier beliefs to the edit descriptors. Our method introduces a novel method of leveraging prompt generation to inform the LLM of relevant belief without changing any parameters of the model. This allows MPROG to tailor the generated prompt to any edit from edit dataset depending on the input efficiently for LLM of any size. We would gladly add the comparison with MEMIT to revised version. **Moreover, is there any support for your claims in the response saying that "the ... performance will drop with a larger number of edits. However, ours drops with a milder slope, ..."? It might be that I missed something, but is there such a comparison (even at a small-scale) for that?** Yes, we indeed compared the drop in performance of MPROG with other baselines with increase in number of edits (from 1 to 128 edits) in lines 269-277 and Figure 3. We observe that MPROG's drop in performance is consistently low compared to gradient-based methods whose performance drop rapidly with larger number of edits. **Qualitative analysis / error analysis: This seems to be important as evaluation is done using basic metrics.** We have used the same standard evaluation metrics as done by previous works like MEND, ENN, SERAC, etc.. We have also provided ablation analyisis of various important novel architectural choices, analysis of compute for scaling to large models as well as performance over increasing number of edits. Previous works such as MEND, ENN, SERAC have also performed similar if not a subset of the analysis. Additionally, we would be happy to add to the Appendix, specific examples where MPROG performs correct predictions where other baselines fail.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully