[In-context Autoencoder for Context Compression in a Large Language Model](https://arxiv.org/abs/2307.06945) ==== ICLR 2024 ###### tags: `group meeting` # Introduction - **Problem**: Transformer-based LLMs struggle with long contexts due to self-attention complexity. Existing solutions reduce computational cost but degrade performance on long inputs. - **Key Idea**: ICAE addresses long context modeling through **context compression**, leveraging an LLM to encode long inputs into a smaller set of **memory slots** without losing essential information. - **Architecture**: - **Encoder**: A learnable module using LoRA to compress long contexts. - **Decoder**: The original LLM, conditioned on memory slots for continuation or task execution. - **Training**: - **Pretraining**: Uses autoencoding (AE) and language modeling (LM) objectives to learn effective compression. - **Fine-tuning**: Enhances memory slot interactions with diverse prompts for real-world applications. - **Results & Contributions**: - Achieves **4× context compression**, improving efficiency while maintaining accuracy. - Enables better **handling of long contexts** with reduced memory and latency. - Offers insights into **LLM memorization**, drawing parallels with human memory mechanisms. - Complements existing **long-context solutions**, allowing further scalability and improvements. # Method: In-Context Autoencoder ## Pretraining - Autoencoding ![image](https://hackmd.io/_uploads/rJbrla_Kyl.png) $$ \mathcal{L}_{\mathrm{AE}}=\max _{\widetilde{m_1, \ldots, \widetilde{m_k}}} P\left(\boldsymbol{c} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{c} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right) $$ To indicate the autoencoding task, we append a special token "[AE]" to $\left(\widetilde{m_1}, \ldots, \widetilde{m_k}\right)$ in the decoder. - Text Continuation ![image](https://hackmd.io/_uploads/Hy79-TdKyg.png) $$ \mathcal{L}_{\mathrm{LM}}=\max _{\widetilde{m_1}, \ldots, \widetilde{m_k}} P\left(\boldsymbol{o} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{o} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right) $$ where $\boldsymbol{o}=\left(w_{L+1}, \ldots, w_{L+N}\right)$ denotes the continuation of context $\boldsymbol{c}$. ## Instruction Finetuning ![image](https://hackmd.io/_uploads/S1TTzpdtye.png) $\begin{aligned} \mathcal{L}_{\mathrm{FT}} & =\max _{\widetilde{m_1} \ldots \widetilde{m_k}} P\left(r_1 \ldots r_n \mid \widetilde{m_1} \ldots \widetilde{m_k}, p_1 \ldots p_m ; \Theta_{L L M}\right) \\ & =\max _{\Theta_{L o R A}, e_m} P\left(r_1 \ldots r_n \mid m_1 \ldots m_k, p_1 \ldots p_m ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)\end{aligned}$ # Experiment ![image](https://hackmd.io/_uploads/HJVNX6OKkx.png) ![image](https://hackmd.io/_uploads/ryePmTdt1x.png) ![image](https://hackmd.io/_uploads/Sk-dEp_Fke.png) [Learning to Compress Prompts with Gist Tokens](https://arxiv.org/abs/2304.08467) ==== NeurIPS 2023 # Introduction - **Problem**: Encoding long prompts repeatedly is computationally expensive (quadratic complexity). - **Existing Solution**: Finetuning or distillation, but requires retraining for each new prompt. - **Proposed Solution**: **Gisting** – compresses prompts into **gist tokens** for efficient reuse. - **Key Idea**: Predict gist tokens to compress the prompt. - **Implementation**: - Insert **gist tokens** after the prompt. - Modify **attention masks** to force compression. - **Results**: - **Up to 26×** prompt compression. - **40% FLOPs reduction** & **4.2% latency speedup**. - Saves **compute, memory, and storage**. - **Applicable to**: **Decoder-only (LLaMA-7B) & Encoder-decoder (FLAN-T5-XXL) models**. # Method: Gisting ![image](https://hackmd.io/_uploads/rytmNaFtJl.png) We have an instructionfollowing dataset $\mathcal{D}=\left\{\left(t_i, x_i, y_i\right)\right\}_{i=1}^N$, where $t$ is a task encoded with a natural language prompt (e.g. Translate this to French), $x$ is an (optional) input for the task (e.g. The cat), and $y$ is the desired output (e.g. Le chat). Given a (usually pretrained) LM, the aim of instruction finetuning is to learn a distribution $p_{\mathrm{LM}}(y \mid t, x)$, typically by concatenating $t$ and $x$, then having the LM autoregressively predict $y$. At inference time, we can prompt the model with a novel task $t$ and input $x$, decoding from the model to obtain its prediction. ## A Context Distillation Perspective $\mathcal{L}_{\mathrm{CD}}\left(p_{\mathrm{CD}}^t, t\right)=\mathbb{E}_x\left[D_{\mathrm{KL}}\left(p_{\mathrm{LM}}(y \mid t, x) \| p_{\mathrm{CD}}^t(y \mid x)\right)\right]$. ## Gist token $\mathcal{L}_G\left(p_G, T\right)=\mathbb{E}_{t \sim T, x}\left[D_{\mathrm{KL}}\left(p_{\mathrm{LM}}(y \mid t, x) \| p_G(y \mid G(t), x)\right)\right.$ ## Learning Gisting by Masking ![image](https://hackmd.io/_uploads/rk3XBTKF1g.png) # Experiment ![image](https://hackmd.io/_uploads/BkIuIaFt1x.png) ![image](https://hackmd.io/_uploads/HJQJITYKyg.png) ![image](https://hackmd.io/_uploads/SyiELaYK1e.png) # Comparison Task - ICAE: general text compression - Gist: for task-specific prompt Compression rate - ICAE: $4X$ - Gist: $26X$ Latency and Computation - ICAE: $3.5X speedup$ in worst case. $7X$ speedup in best case. - Gist: $40\%$ FLOPs reductions. $4.2\%$ wall time speedups.(equivalent to 1.044X speedup) Similarity - Both achieves prompt compression by fine-tuning an LLM in a similar way. Dissimilarity - Gisting's compression is limited to transform a task description into a few compressed (task) vectors (i.e., gist tokens). It does not involve content compression. - Gisting itself requires fine-tuning the LLM to compress, and its produced gist tokens is not intended for the original LLM but for its fine-tuned LLM, which is not what we aim for. LLM - ICAE: decoder LLM(LLaMA) - Gist: decoder LLM(LLaMA), encoder-decoder LLM(T5) Objectives - ICAE: - Pretraining: Autoencoding, text continuation - Instruction fine-tuning - Gist: Simply instruction fine-tuning with masking.