[In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945)
====
ICLR 2024
###### tags: `group meeting`
# Introduction
- **Problem**: Transformer-based LLMs struggle with long contexts due to self-attention complexity. Existing solutions reduce computational cost but degrade performance on long inputs.
- **Key Idea**: ICAE addresses long context modeling through **context compression**, leveraging an LLM to encode long inputs into a smaller set of **memory slots** without losing essential information.
- **Architecture**:
- **Encoder**: A learnable module using LoRA to compress long contexts.
- **Decoder**: The original LLM, conditioned on memory slots for continuation or task execution.
- **Training**:
- **Pretraining**: Uses autoencoding (AE) and language modeling (LM) objectives to learn effective compression.
- **Fine-tuning**: Enhances memory slot interactions with diverse prompts for real-world applications.
- **Results & Contributions**:
- Achieves **4× context compression**, improving efficiency while maintaining accuracy.
- Enables better **handling of long contexts** with reduced memory and latency.
- Offers insights into **LLM memorization**, drawing parallels with human memory mechanisms.
- Complements existing **long-context solutions**, allowing further scalability and improvements.
# Method: In-Context Autoencoder
## Pretraining
- Autoencoding

$$
\mathcal{L}_{\mathrm{AE}}=\max _{\widetilde{m_1, \ldots, \widetilde{m_k}}} P\left(\boldsymbol{c} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{c} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)
$$
To indicate the autoencoding task, we append a special token "[AE]" to $\left(\widetilde{m_1}, \ldots, \widetilde{m_k}\right)$ in the decoder.
- Text Continuation

$$
\mathcal{L}_{\mathrm{LM}}=\max _{\widetilde{m_1}, \ldots, \widetilde{m_k}} P\left(\boldsymbol{o} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{o} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)
$$
where $\boldsymbol{o}=\left(w_{L+1}, \ldots, w_{L+N}\right)$ denotes the continuation of context $\boldsymbol{c}$.
## Instruction Finetuning

$\begin{aligned} \mathcal{L}_{\mathrm{FT}} & =\max _{\widetilde{m_1} \ldots \widetilde{m_k}} P\left(r_1 \ldots r_n \mid \widetilde{m_1} \ldots \widetilde{m_k}, p_1 \ldots p_m ; \Theta_{L L M}\right) \\ & =\max _{\Theta_{L o R A}, e_m} P\left(r_1 \ldots r_n \mid m_1 \ldots m_k, p_1 \ldots p_m ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)\end{aligned}$
# Experiment



[Learning to Compress Prompts with Gist Tokens](https://arxiv.org/abs/2304.08467)
====
NeurIPS 2023
# Introduction
- **Problem**: Encoding long prompts repeatedly is computationally expensive (quadratic complexity).
- **Existing Solution**: Finetuning or distillation, but requires retraining for each new prompt.
- **Proposed Solution**: **Gisting** – compresses prompts into **gist tokens** for efficient reuse.
- **Key Idea**: Predict gist tokens to compress the prompt.
- **Implementation**:
- Insert **gist tokens** after the prompt.
- Modify **attention masks** to force compression.
- **Results**:
- **Up to 26×** prompt compression.
- **40% FLOPs reduction** & **4.2% latency speedup**.
- Saves **compute, memory, and storage**.
- **Applicable to**: **Decoder-only (LLaMA-7B) & Encoder-decoder (FLAN-T5-XXL) models**.
# Method: Gisting

We have an instructionfollowing dataset $\mathcal{D}=\left\{\left(t_i, x_i, y_i\right)\right\}_{i=1}^N$, where $t$ is a task encoded with a natural language prompt (e.g. Translate this to French), $x$ is an (optional) input for the task (e.g. The cat), and $y$ is the desired output (e.g. Le chat). Given a (usually pretrained) LM, the aim of instruction finetuning is to learn a distribution $p_{\mathrm{LM}}(y \mid t, x)$, typically by concatenating $t$ and $x$, then having the LM autoregressively predict $y$. At inference time, we can prompt the model with a novel task $t$ and input $x$, decoding from the model to obtain its prediction.
## A Context Distillation Perspective
$\mathcal{L}_{\mathrm{CD}}\left(p_{\mathrm{CD}}^t, t\right)=\mathbb{E}_x\left[D_{\mathrm{KL}}\left(p_{\mathrm{LM}}(y \mid t, x) \| p_{\mathrm{CD}}^t(y \mid x)\right)\right]$.
## Gist token
$\mathcal{L}_G\left(p_G, T\right)=\mathbb{E}_{t \sim T, x}\left[D_{\mathrm{KL}}\left(p_{\mathrm{LM}}(y \mid t, x) \| p_G(y \mid G(t), x)\right)\right.$
## Learning Gisting by Masking

# Experiment



# Comparison
Task
- ICAE: general text compression
- Gist: for task-specific prompt
Compression rate
- ICAE: $4X$
- Gist: $26X$
Latency and Computation
- ICAE: $3.5X speedup$ in worst case. $7X$ speedup in best case.
- Gist: $40\%$ FLOPs reductions. $4.2\%$ wall time speedups.(equivalent to 1.044X speedup)
Similarity
- Both achieves prompt compression by fine-tuning an LLM in a similar way.
Dissimilarity
- Gisting's compression is limited to transform a task description into a few compressed (task) vectors (i.e., gist tokens). It does not involve content compression.
- Gisting itself requires fine-tuning the LLM to compress, and its produced gist tokens is not intended for the original LLM but for its fine-tuned LLM, which is not what we aim for.
LLM
- ICAE: decoder LLM(LLaMA)
- Gist: decoder LLM(LLaMA), encoder-decoder LLM(T5)
Objectives
- ICAE:
- Pretraining: Autoencoding, text continuation
- Instruction fine-tuning
- Gist: Simply instruction fine-tuning with masking.