[PDF](https://www.overleaf.com/project/64d25d3473604ee034e878b8)
## General answer
We would like to thank you for all your valuable feedback, both positive and negative, which we believe will help us to improve the quality of our work.
We are delighted to note that the reviewers (Sux9, v42b, GWpw) prized the simplicity of our method and noted the potential impact (Sux9) of extending the context length (v42b, GWpw). Reviewer fPf8 noted the efficiency of FoT and Sux9 prized the synthetic dictionary task.
Reviewers Sux9 and v42b raised important concerns about the scope and contributions of the paper, which are addressed below. Moreover, we would like to advertise new important experiments.
### New experiments with large models
In the period between the submission and the rebuttal, we secured additional compute resources, which let us confirm that our method is useful for much larger models. We believe that this significantly strengthens the paper. Specifically, we fine-tuned $3B$ and $7B$ OpenLLama models with our FoT objective. The resulting models exhibit advancements in tasks requiring long context. Following that we extend the contribution list accordingly.
Below we shortly summarise the properties of our models. We would be happy to provide more details here if needed. Otherwise, we present them in an additional section in the paper. Specifically, our new models:
1. exhibit long-context capabilities on downstream tasks (see tables below),
2. retain the performance of the original models on short-context tasks,
3. are more compute and memory efficient at inference, compared to vanilla Transformers with the same effective context length,
4. have some context extrapolation capabilities. We illustrate that our models manage a 256k context length for passkey retrieval task from [1] even though being trained on 8k context. **(see Figure 1 in the attached pdf)**.
Ad 1. Our model exhibit performance gains from additional in-context few shot examples on TREC question classification [2, 3] and WebQS question answering [4]. What is more it shows improvements in F1 score on Qasper (Question Answering over Scientific Research Papers) task [5], which is a part of SCROLLS [6].
| Context/Setup | TREC: FoT fine-tuned OpenLLaMA 3B | TREC: FoT fine-tuned OpenLLaMA 7B | WebQS: FoT fine-tuned OpenLLaMA 3B | WebQS: FoT fine-tuned OpenLLaMA 7B |
|---------|---------------------------------|---------------------------------|----------------------------------|----------------------------------|
| 2K | 67.0 | 63.2 | 21.2 | 25.5 |
| 4K | 71.6 | 72.7 | 21.4 | 26.4 |
| 6k | 72.9 | 74.9 | 22.2 | 27.2 |
| 8K | 73.3 | 75.9 | 22.4 | 27.7 |
For Qasper, we used the implementation from Language Model Evaluation Harness and observed that our 3B model benefits from the context increase. Below we provide zero-shot results. Note that LongChat 7B [7] was instruction fine-tuned.
|Context length | OpenLLaMA 3B | FoT fine-tuned OpenLLaMA 3B | LLaMA 7B | LongChat 7B |
| - | - | - | - | - |
| 2K | 18.7 | 18.7 | 18.7 | 19.4 |
| 4K | - | 20.7 | - | 21.2 |
| 6K | - | 23.2 | - | 25.0 |
| 8K | - | 26.6 | - | 28.8 |
Ad 2. Our fine-tuned OpenLLaMA models maintain the performance on the standard suite of short-context tasks from Language Model Evaluation Harness (we use the same collection of tasks as OpenLLaMA and provide the average scores):
|Model | OpenLLaMA 3B | FoT fine-tuned OpenLLaMA 3B | OpenLLaMA 7B | FoT finetuned OpenLLaMA 7B|
|- | -| - |- | -|
|Average score | 0.53| 0.53 | 0.55 | 0.55|
### Scope and contributions of the paper
To clarify, *our paper focuses on the long-context capabilities*. We agree that the current writing is somewhat unclear. We have identified the following issues which might have caused the confusion:
<!-- - We now stress that handling large external databases was the initial motivation of FoT which was later changed to long-context. -->
- We used the term 'external memory', which we now change to 'additional context'.
- Memorizing Transformer, on which we base our method, is framed as a retrieval method. We now explicitly state in the related work section, that despite these similarities, our aim is different. Moreover, we amend the related work to include more long-context papers.
- We include new long context tasks (see above). We keep the multi-doc experiments for ilustrative purposes. However, we make explicit that the focus is on the long-context.
We thank the reviewers for pinpointing this clarity issue. We hope that the above changes will address the concerns. We would be happy to make further adjustments if the reviewers find it useful.
### Tuning and hyperparamters
Reviewers (V42b, GWpw) raised questions about hyperparameters. We note that some of the choices were educated guesses, as due to extreme computational cost we could not perform a full hyperparameter search. For example, for the memory layer we based on the findings from Memorizing Transformer. This information is now added as a limitation.
[1] A. Mohtashami, et al. Landmark Attention: Random-Access Infinite Context Length for Transformers.
[2] Li, Xin et al. Learning Question Classifiers.
[3] E. Hovy, et al. Toward semantics-based answer pinpointing.
[4] J. Berant, et al. Semantic Parsing on Freebase from Question-Answer Pairs.
[5] P. Dasigi, et al. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers.
[6] U. Shaham, et al. SCROLLS: Standardized CompaRison Over Long Language Sequences.
[7] D. Li*, et al. How Long Can Open-Source LLMs Truly Promise on Context Length?
---
### Answer to Sux9 (Score 4)
We thank the reviewer for their constructive feedback.
We admit deficiencies in clarity raised by the reviewer. We note that we focus on *the long-context capabilities*; see also the general response. In our experiments, we tested FoT in both single-doc and multi-doc scenarios to assess its potential usefulness. We found that FoT improves perplexity on single, long documents (see Section 4.5), and we believe this makes it applicable to generic long-context language modeling, which is strongly confirmed by our new experiments with large models. At the time of writing, we decided to keep some multi-doc experiments, e.g., to illustrate the distraction issue, which already impairs the model’s perplexity significantly at a relatively small scale (64 documents, see Figures 3,7). However, in retrospect, we recognize this might be confusing. To amend this, we apply the steps described in the general answer; in particular, we indicate the long context focus more explicitly.
We also note that there are practical applications where the multi-document setting is well-motivated, in particular repository-level code generation. We hope that our method could be scaled up to open possibilities for handling the entire repositories of code in context (possibly ~1M tokens in large codebases), which we plan to attempt in future work.
**Regarding the 'external memory'**. By this, we understand anything outside the local context window, i.e., anything that is accessed additionally by memory attention layers. This clarification has been added to the paper. To make this name less confusing, we changed it to 'additional context'.
**Regarding 'positive and negative examples'**.
Our method is inspired by contrastive learning in the way how data is presented to the model. We assume that the model is presented with distractions (possibly irrelevant documents) in the training phase. The previous local context (from the current document) is mixed with contexts from other documents in the batch. Intuitively, this 'forces' the model to learn to differentiate between the positives (tokens from the current document, which are likely to be useful) from the negatives (tokens from other documents, which are unlikely to be useful). We note that this is not standard contrastive learning, as we do not have a separate contrastive loss. We only use the standard language modeling loss. We have added a clarification to the paper.
**Regarding Table 2**. We agree that it is not the best way to compare the models, but we were constrained by pre-trained models (as different tokenizers are used, comparing perplexity is not informative). We only aim to show that we get better accuracy with more context available for a given model, in contrast to comparing token-level accuracies between models with different tokenizers which is inconclusive. A comment has been added to the caption.
**Regarding [1, 2]**
We first mention that our focus is different. As now clarified, we aim for a long context, while these papers are focused on retrieval from a large knowledge database. We have added a clarification to the related work section. On the technical level [1] combines two probability distributions to predict the next token: one given by the model logits and the other created from retrieved pairs of (embedding, next token). Meanwhile, we extend the model context in a subset of attention layers, which potentially allows for reasoning within this extended context.
We thank the Reviewer for raising the topic of usefulness of other documents in the batch. It was observed that nearest neighbors language models (kNN-LM architecture) display almost linear perplexity gains wrt. datastore size [3]. Due to practical limitations, in this work we only embed ~100K tokens during training, and documents in the batch are randomly sampled from a large corpus, which means it is unlikely that they are related to each other. Therefore, we should not expect significant perplexity gains for kNN-LM in that setting either, as the training bach comprises of approximately 0.1% of the datastore. Empirically, we show that extending model's context length with attention instead of kNN leads to perplexity increase, due to the aforementioned distraction issue. To the best of our knowledge, the distraction (perplexity increase) resulting from increasing attention context length hasn't been studied before.
We agree with the Reviewer that TRIME [2] proposes a very similar objective inspired by contrastive learning, which is already mentioned in the related work section. The main difference is architectural: instead of attending to additional tokens in the memory layer, like in FoT, they combine probability distributions of the dense model and the retrieval database in the final layer, like [1]. Moreover, [2] focuses on retrieval from large databases, whereas our experiments mostly focus on long context. We have included this discussion in the related work section of the updated paper.
**Regarding the distraction issue at the inference time**, giving the model multiple unrelated documents is an extreme case. The distraction issue could possibly occur in single-doc scenarios for long documents consisting of several chapters. Please note that despite alleviating the distraction issue FoT allows to train long context models using short-context data and improves performance in single-doc cases (see Section 4.5).
[1] Khandelwal et al., Generalization through Memorization: Nearest Neighbor Language Models, 2019
[2] Zhong et al., Training Language Models with Memory Augmentation, 2022
If our responses have adequately addressed your concerns, we kindly request your support and considerating of improving your score. If you have any further concerns or additional points to raise, we are eager to address them. Your feedback is valuable in enhancing the quality and impact of our research.
---
### Answer to V42b (Score 4)
We thank the Reviewer for their thoughtful feedback. We acknowledge some deficiencies in the presentation. We focus on the long-context capabilities; the appropriate clarification is described in the general answer. In more detail, we aim for a single-stage method that can incorporate a large number of tokens directly in the model context (kNN attention can be used to approximate full attention). We observed that increasing context length naively gives worse results which is also confirmed e.g., in [1]. This is not in contradiction with the fact that the retrieval methods benefit from additional documents. The difference lies in the fact that they typically use a two-stage approach, with the retrieval part doing the hard job of extracting only relatively a small amount of tokens, which are efficiently processed within the standard context length [2]. We include this clarification in the paper. We also make a number of smaller adjustments to the paper, which hopefully make the paper easier to follow. If the reviewer sees any specific issue, we'd be happy to address it.
We also acknowledge some issues with clarity the method description. We outline the general idea, and we admit that it might be hard to infer details from it. We think presenting the details in the text would be quite cumbersome. To amend the situation, we plan to include a shortened version of the pseudocode from the Appendix in the main body. The pseudocode has been found helpful by Rev #fPf8. Thus, we hope it will satisfactorily complement the description. If you see any other parts, which require clarification please let us know.
Below we address the questions:
1. As noted in the general response, due to the limited computational resources, we could not perform full hyperparameter sweeps. In particular, for the memory layer, we have followed the choice of Memorizing Transformers [3].
2. Regarding the improvements in existing language models and benchmarking on additional long-context tasks, as noted in the general response, we present 3B and 7B models based on OpenLLaMA along with results on Qasper (SCROLLS benchmark), TREC and WebQS where we show improved performance when the model is provided with additional context.
3. Regarding the performance on the synthetic task. Please note that the model is trained in a much shorter context than it is evaluated, which makes it out of distribution.
4. Regarding the content of the memory in Section 4.3 during evaluation - this is a single-doc memory; that is, in the additional context we only store keys and values belonging to the currently processed document.
5. The distance measure for kNN is inner product.
6. We have tested values of $k\in \{32, 64, 128\}$ and observed small differences in performance. We add this information to Appendix.
Thank you for pointing out [4], we add the following description to the related work section:
> CONTRACLM [4] applies contrastive losses at both the token and sequence levels during training to promote more uniformly distributed, isotropic representations. It is shown to enhance the discrimination of representations on textual semantic similarity benchmarks. While CONTRACLM focuses on improving the general expresiveness of representations, our work introduces contrastive-inspired techniques designed specifically for training the memory attention mechanism to handle longer context lengths. Nonetheless, exploring other contrastive learning objectives could be beneficial for further improving the memory key structure in future work.
[1] Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts
[2] Sebastian Borgeaud et al., Improving language models by retrieving from trillions of tokens
[3] Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy. Memorizing Transformers.
[4] Jain et al., CONTRACLM: Contrastive Learning For Causal Language Model, 2022
We again thank the reviewer for raising important issues. We hope that our answers are satisfactory. If not, we’d be happy to provide more details. Otherwise, we’d appreciate if the reviewer reconsideres the final score of our submission.
---
### Answer to fPf8 (Score 8)
We thank the reviewer for the encouraging review.
The description of the method outlines the general idea, and we admit that it might be hard to infer details from it. We think presenting the details in the text would be quite cumbersome; thus, we plan to include the shortened versions of the code in the main body of the paper. We hope this will be satisfactory.
**Regarding the kNN**, we consider kNN to be an approximation of the full dense attention. With such a perspective, there is no inconsistency. However, in practice, the approximation errors may impact the performance. We did not observe this in our experimental regime. We leave proper studies of this to future work. We also note that our fine-tuned versions of OpenLLaMA models use full dense attention instead of kNN, which we find performant and efficient enough. Addionaly, we note that using kNN opens the possibility of using fast approximate indices (e.g., implemented in Faiss), which might be necessary for scaling the method. We have added this to future work.
**Regarding the comparison with Parallel Context Window**, we add the following text to related work.
> Parallel Context Window introduces a method for extending the context of language models without training. They achieve this by embedding several context windows independently in parallel and allowing only a subset of tokens to attend to all windows. On the other hand, we fine-tune existing models and allow all tokens to attend to all previous tokens but only in a subset of layers. Our method also allows us to improve the structure of $(key, value)$ space of existing models.
Regarding the questions:
1. We observed that it is important to have at least one positive example that brings additional related information to memory layers (for example, previous local context window $C_{prev}$). Otherwise, the model may learn to ignore memory layers.
2. For each query in a memory layer, we take $k$ most matching keys from memory and add them to the attention for this query. That is, each query will attend only to all keys that precede it in the local context and its own $k$ most matching keys from the memory. In the non-kNN approach, each query attends to the whole memory and all keys that precede it in the local context. To calculate $k$ most matching keys we use the inner product. Note that in models presented in the paper, we remove positional encodings in memory layers.
3. We have managed to fine-tune OpenLLaMA models so that they maintain the performance of the base models on short-context Language Model Evaluation Harness tasks and show improvements on long-context ones. For details, please refer to the table from the general response.
We again thank you for an encouraging review. Shoud you have any futher questions or concerns, we'd be happy to answer. We kindly ask to support our paper.
---
### Answer to GWpw (Score 7)
We thank the reviewer for a thoughful review.
**Regarding the computational cost**. We thank you for raising this important concern, which we have added to the limitation section. We also note two factors that mitigate this issue to some extent. First, the increased cost is only in the memory layer. Second, FoT exhibits some context extrapolation, which might allows using smaller $n$ in training. Moreover, realtively small values of $d$ might be sufficient. We managed to fine-tune OpenLLaMA models using less than 8 distractions.
Regarding the questions:
**Q1:** We thank the Reviewer for proposing an interesting baseline. To answer the question, we fine-tune a vanilla OpenLLaMA model on sequences of length $4096$ (original seq. len $2048$), which we consider as just standard “data packing" baseline, and compare it to a FoT model trained on exactly the same data packed in the same way for 1B tokens. For clarity, we outline the following architectural differences between the baseline and FoT:
* Additional context beyond $2048$ tokens is used in just a subset of layers
* FoT does not use positional encodings in memory layers beyond its original context window ($2048$)
The results are as follows:
|Context/Setup | TREC: baseline $\pm 1.0$ $~~$ | TREC: FoT $\pm 1.0$ $~~$ | WebQS: baseline $\pm 0.1$ $~~$ | WebQS: FoT $\pm 0.1$ $~~$ |
| - | - | - | - | - |
| 2K | 52.8 | 55.6 | 20.7 | 20.8 |
| 4K | 57.2 | 60.9 | 18.7 | 21.0 |
We observe accuracy improvements when more few-shot demonstrations are provided in the extended context (from 2K used by OpenLLaMA to 4K used in our fine-tuning). On TREC, the gains from additional context are significant for both models. Our method presents better data efficiency than the baseline.
**Q2:** Starting with a large $d$ (crossbatch dimension) may slow down the process and result in memory layer being ignored by the model. We have not seen any such problems when starting with a smaller value of $d\leq8$. See the plot with training loss comparision in the attached pdf.
**Q3:** Due to the limited resources, we have followed the choice of Memorizing Transofrmer in picking the memory layer. We have also seen some additional gains from using multiple memory layers in our FoT fine-tuned OpenLLaMA models.
If your concerns have been sufficiently addressed in our responses, we humbly seek your support for the paper. Should you have any further concerns or additional points to raise, we are eager to address them.
| Method | Proof rate (%) |
| ---- | ---- |
| BM25 | 30.2 |
| tf-idf | 31.8 |
| OpenAI embed. (text-embedding-ada-002) | 36.1 |
| Magnushammer (38M) - SELECT-ONLY | 54.2 |
| Magnushammer (38M) | 56.3 |