# 進度會議
###### tags: `progress`
# 6/20

inference output
**example 1**
**question**: What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?
**context**: ... As an adult, she was named United States ambassador to Ghana and to Czechoslovakia and also served as Chief of Protocol of the United States. Janet Marie Waldo (February 4, 1920 – June 12, 2016) was an ...
**predict output** :United States ambassador to Ghana and to Czechoslovakia
**groung truth**:Chief of Protocol
**example 2**
**question**: What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species?
**context**: The Andre Norton Award for Young Adult Science Fiction and Fantasy is an annual award presented by the Science Fiction....
**predict output** :Animorphs
**groung truth**:Animorphs

`test_set_size`: 2705
# 6/13
**Training Outcome**
<table>
<tr>
<td><img src="https://hackmd.io/_uploads/rJgohwKXgx.png" width="120px"></td>
<td><img src="https://hackmd.io/_uploads/rkgs3PFmgx.png" width="120px"></td>
<td><img src="https://hackmd.io/_uploads/rygshwKXex.png" width="120px"></td>
</tr>
<tr>
<td style="text-align:center;"><small>Step 0</small></td>
<td style="text-align:center;"><small>Step 49,500</small></td>
<td style="text-align:center;"><small>Step 132,800</small></td>
</tr>
</table>
**Dataset:** Natural Questions (with document-level context, ~2,500 examples)
**Epochs:** 50
The dot plots illustrate the training dynamics of contrastive learning.
Over time, the red points (anchors) and green points (positives) move closer together, indicating that the model is learning to align semantically relevant representations in the embedding space.

**Dataset:** HotpotQA (90,447 examples)
**Epochs:** 2
**Training time per epoch:** ~17 hours
**Loss curves:**
- **Loss 1:** Contrastive loss + cross-entropy loss
- **Loss 2:** Cross-entropy loss only
The model trained with additional contrastive supervision (Loss 1) converges faster and reaches a lower final loss compared to the baseline trained with only cross-entropy (Loss 2).
# 6/6
## Contrastive learning loss


**input**: [query] + [relevant context] + [irrelevant context]
**output**: last layer's representation
**epochs**:50
**data num per epoch**: 2600
**Adjustment**: not only copy and freeze the LoRA, but also the **embedder**.
**current epoch**: 16
**loss**: 0.04
The contrastive loss for embedding is followed by the work SimCSE: Simple Contrastive Learning of Sentence Embeddings, where the embedding of a sentence is pooled into one-token's length.
observations:
Should add **<SEP>** spacial token between different contexts.
# 5/29
**RuntimeError**: one of the variables needed for gradient computation has been modified by an <font color="#1936C9">**inplace operation**</font> : [torch.cuda.FloatTensor [512, 1024]], which is output 0 of AsStridedBackward0, is at version 5; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
其中[512, 1024]的512是lora的r值

**可能問題:**
在每一次forward時會先生成anchor memory slots再load frozen LoRA 生成正負樣本。在計算梯度前更換adpater會是一種inplace operation並影響到computation graph。
**computation graph:**
每次前向傳播時都會建構一張computation graph,記錄每個tensor之間的操作(例如加法、乘法、卷積等),以及如何計算梯度。
**in-place operation 對computation grap的影響:**
反向傳播需要存取前向傳播中間結果的原始值。但如果用 in-place operation 改變了這些中間結果,會導致:
computation grap中的節點所依賴的資料已經改變,無法正確計算梯度。
# 5/16
[Focused Transformer](https://arxiv.org/pdf/2307.03170)
[Contrastive Instruction Tuning](https://arxiv.org/pdf/2402.11138)
[Self-Supervised Contrastive Learning with Adversarial Perturbations for Defending Word Substitution-based Attacks](https://aclanthology.org/2022.findings-naacl.8/)
[Adversarial Training with Contrastive Learning in NLP](https://arxiv.org/abs/2109.09075)
[A Survey on Data Augmentation for Text Classification](https://arxiv.org/abs/2107.03158)
[FreeLB: Enhanced Adversarial Training for Natural Language Understanding](https://arxiv.org/abs/1909.11764)
## Defect
| pretrain stage | CL stage |
| --- | --- |
| <img src="https://hackmd.io/_uploads/rkb2C4ixlg.png" width="300px"> | <img src="https://hackmd.io/_uploads/By1rfrilxg.png" width="300px"> |
- pretrain stage 與 contrastive learning stage 壓縮目標不一致
在pretrain stage 目標是要壓縮整個context 而在 CL stage 是要壓縮部份context
Q: 不同stage為何會互相影響?
在CL stage 的positive/negative sample 應該是要完整的壓縮,而anchor input是部份壓縮, 且輸入組成、壓縮率不同, 因此不應該使用相同的lora adapter, 而這樣會使得我們需在多load一個模型for pretrained lora, 而這樣並不實際。
- 不符合 contrastive learning 的架構
| Original CL | Our CL |
| --- | --- |
| <img src="https://hackmd.io/_uploads/BJxBCBN-xe.png" width="100px"> | <img src="https://hackmd.io/_uploads/By1rfrilxg.png" width="300px"> |
在contrastive learning 中 positive sample 應該是基於anchor 的 transformation, 架構上還是要一樣。但在我們的work中, anchor和positive sample的架構不一樣(缺少了query), 因此訓練過程該調整如下圖。
| Original | Revised |
| --- | --- |
| <img src="https://hackmd.io/_uploads/By1rfrilxg.png" width="300px"> | <img src="https://hackmd.io/_uploads/ry-kqOV-gx.png" width="300px"> |
1. postive sample 由 anchor input轉換而來, 轉換可以是採取adversarial training的方法在embedding上
2. negative sample 可以是其他同個dataset中其他的資料。尋找hard negative sample on NLP的部分還要survey
3. 不需要先前的pretraining stage,如前面所說,pretraing stage與該work的目標不一致。
4. 在adversarial training的minmax 裡的loss function會是Response 跟 ground truth的cross entropy loss,而不是壓縮間的contrastive loss。這樣就可以少掉instruction finetuning的stage。
# 5/9
Llama 3.2 3b pretrain 在 Pile dataset 上 (進行中)
分別訓練了 lora <font color=#FF6600>r=128</font> 和 <font color=#0000FF>r=512</font>
| Model | Dataset | LoRA r 值 | Step | Loss |
|------------------|--------------|-----------|---------|--------|
| LLaMA 3.2 3B | Pile (進行中) | 128 | 500000 | 0.0713 |
| LLaMA 3.2 3B | Pile (進行中) | 512 | 500000 | 0.0349 |

## pretrain inference

`original text`
"I don't have a favorite condiment as I don't consume food or condiments. However, I can tell you that many people enjoy condiments like ketchup, mayonnaise, mustard, soy sauce, hot sauce, and ranch dressing, among others. The favorite condiment can vary greatly from person to person, depending on their taste preferences and cultural influences."
`reconstructed text`
"I don't have a favorite condiment as I don't consume food or condiments. However, I can tell you that many people enjoy condiments like ketchup, mayonnaise, mustard, soy sauce, hot sauce, and ranch dressing, among others. The favorite condiment can vary greatly from person to person, depending on their taste preferences and cultural influences."
## dataset for contrastive learning
Natural Question with retrieval
Each datum contains
- one questions
- one answer
- one golden retrieval with ansewer
- nine high relevant retrieval but without answer

# 5/2
## Pretraining
Llama 3.2 3b pretrain 在 Pile dataset 上 (進行中)
分別訓練了 lora <font color=#FF6600>r=128</font> 和 <font color=#0000FF>r=512</font>
| Model | Dataset | LoRA r 值 | Step | Loss |
|------------------|--------------|-----------|---------|--------|
| LLaMA 3.2 3B | Pile (進行中) | 128 | 198000 | 0.169 |
| LLaMA 3.2 3B | Pile (進行中) | 512 | 186000 | 0.106 |

ICAE 中的 pretrain loss:
| Model | Pretrain Loss |
|---------------|---------------|
| LLaMA-7B | 0.017 |
| LLaMA-2-7B | 0.009 |
| LLaMA-2-13B | 0.004 |
目前收斂情況跟其他篇paper reproduce ICAE的訓練過程差不多

| loss 來源 | Compression Rate | Batch Size | Step 數 | LoRA r 值 | Loss |
|----------------|------------------|------------|---------|-----------|--------|
| 其他 Paper | 31× | 4 | 40,000 | 128 | ~0.2 |
| 我的訓練結果 | 4× | 1 | 160,000 | 128 | 0.1841 |
## Main Figure of CAT-RAG(Contrastive Adversarial Training for Retrieval-Aware Generation)

# ...
在該研究中, 會使用到contrastive learning, 因此需要正負樣本. 在最直覺簡單的思路下, 直接將query 對應的 context, 也就是relevant retrieval 作為正樣本, 並把其他query 對應的 context, 也就是irrelevant context 作為負樣本。而這樣可譨帶來的問題是, 正樣本集合只能透過隨機truncate token來製作, 可能會truncate掉正確答案; 而負樣本因為是隨機選取, 可能距離已經離anchor很遠, 這樣contrastive learning可能部會帶來太大的變化。因此, 如何選取正負樣本是很重要的。在adversarial attack 的任務下, 會透過像是FGSM, PGD等方法生成adversarial sample,一種容易混淆模型的樣本 ,以此來加強模型的robustness。在這邊, 我們也許可以將irrelevant context是做一種adversarial attack, 並且透過生成距離靠近anchor的負樣本和距離香遠的正樣本來做對比學習, 可以使的壓縮過後的prompt可以有更多filter out irrelevant context的能力。
提到 prompt compression 再RAG情況下的重要性。
本研究提出了使用contrastive learning的方法來讓compressor學習如何去掉不相關的檢索 以提高compression rate和robustness。
帶入了adversarial的概念找出Adversarial sample來加強contrastive learning的效果。
該研究希望能達成三個目標
壓縮率的提升
準確度的提升
計算量的下降
# 3/13
~~壓縮方法主要要達成兩個重點, 第一個是在幾乎不失真的情況下壓縮內容, 在本研究中會使用query-aware 的方法來進行壓縮,並使用contrastive learning 讓compressor可以更好的去除不相的內容 ;第二是壓縮的overhead要小才能達到efficieny, 在這篇將會利用檢索間通常互為獨立的特性,使用block attention的方式,來達到sparse attention的效果。
與該研究最相關的paper是ICAE, ICAE 是soft prompt的方法,透過fine-tunes a LoRA-adapted LLM to encode the context with memory tokens into memory slots 以達到壓所的效果, 另外還有其他compression method based on ICAE的像是, AOC, 目的是去掉compressor 的MLP層, 在減少encoder 參數量的情況下達到comparable results, 或像是500x compressor, 目的是compresses prompts into neural attention key-value pairs instead of explicit representations of memory slots. 而該研究的目的是透過考慮query來對內容進行壓縮. 而之所以選擇icae是因為其簡易明瞭的架構, "orthogonal to other long context modeling studies and can be combined with them to further improve the handling of long contexts in an LLM". 其中query-aware 的好處是1.提高壓縮率, 透過去除不相關、可能導致LM幻覺的內容去除並2.提升準確率。query-aware相關的研究像是QGC, 使用pooling方法和2層的transformer layer 來進行query-aware的壓縮,但該方法並沒利用到LM本身的能力。~~
## 題目:
- Query-aware compression through contrastive learning for efficient Retrieval-Augmented Language Models
- Relevance-aware prompt compression through contrastive learning
壓縮方法主要需達成兩個重點:
1. **在幾乎不失真的情況下壓縮內容** —— 本研究將使用 **query-aware** 方法進行壓縮,並透過 **contrastive learning** 來提升 **compressor** 去除不相關內容的能力。
2. **降低壓縮的開銷 (overhead) 以提升效率** —— 本研究利用檢索間通常互為獨立的特性,透過 **block attention** 實現 **sparse attention**,以減少計算成本。
與本研究最相關的論文是 **ICAE**,
該方法透過 **fine-tuning a LoRA-adapted LLM**,將上下文編碼為 **memory tokens** 並存儲至 **memory slots** 來實現壓縮。此外,還有一些基於 **ICAE** 的壓縮方法,例如:
- **AOC**:去除了 **compressor** 的 **MLP 層**,在減少 **encoder** 參數量的同時仍能達到相近的效果。
- **500x compressor**:將 **prompts** 壓縮為 **neural attention key-value pairs**,而非顯式的 **memory slots** 表示。 
本研究的核心目標是透過 **query-aware** 方法來進行內容壓縮,選擇 **ICAE** 的原因在於其結構簡單明瞭,且在準確率、壓縮率和速度上都有很好的表現。
使用 **query-aware** 方法的優勢包括:
1. **提高壓縮率** —— 透過去除不相關內容,減少可能導致 **LM hallucination** 的因素。
2. **提升準確率** —— 確保壓縮後的內容仍能維持高準確度。
與 **query-aware** 相關的研究還有 **QGC**,
該方法利用 **pooling** 方法與兩層 **transformer layer** 來進行 **query-aware** 壓縮,但它並未充分利用 **LM** 本身的能力。
## 方法
## 1. reconstruction

$\mathcal{L}=\max _{\Theta_{L o R A}, e_m} P\left(\mathbf{w} \mid \mathbf{w}, \mathbf{m} ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$
## 2. query-aware contrastive learning

$\mathcal{L}_{CL}=\max _{\Theta_{L o R A}, e_m} \log \frac{e^{\operatorname{sim}\left(\widetilde{m}, \widetilde{m}^+\right) }}{e^{\operatorname{sim}\left(\widetilde{m}, \widetilde{m}^+\right) } + e^{\operatorname{sim}\left(\widetilde{m}, \widetilde{m}^-\right) }}$
$\mathcal{L}_{CE}=\max _{\Theta_{L o R A}, e_m} P\left(\mathbf{w^+} \mid \mathbf{q}, \mathbf{w^+}, \mathbf{w^-}, \mathbf{m}; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$
$\mathcal{L} = \alpha\mathcal{L}_{CL} + (1-\alpha)\mathcal{L}_{CE}$
## 3. instruciton finetuning

$\mathcal{L}=\max _{\Theta_{L o R A}, e_m} P\left(\mathbf{r} \mid \mathbf{q}, \mathbf{w^+}, \mathbf{w^-}, \mathbf{m}; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$
## Efficient Compressor

# 2/27
Query-aware Compression by Contrastive Learning
### Reconstruction

### Contrastive Learning

### Finetuning

###
# 2/20
兩個改進的點基於ICAE:
1.減少encoder(compressor)的計算量
2.query-aware training

## Reduction of computation cost
為什麼要減算計算量?
compressor的計算量應該遠小於decoder的計算量,這樣壓縮才會有意義。
為什麼compressor只需要較少的參數量?
compressor的用意是去捕捉coment的資訊並輸出壓縮後的內容,並不需要有"生成的能力",因此compressor"可能"並不需要那麼多的參數。
Preliminary:
[Attention-Only Compressor]((https://arxiv.org/pdf/2501.06730)): 去掉compressor 的 MLP,只做attention,且不做LORA。實際訓練參數會比ICAE多因為沒用LORA,但performance比ICAE好,加了LORA的話performance會很差。

[BLock Attention](https://arxiv.org/pdf/2409.15355)、[Mixture of Block Attention](https://arxiv.org/pdf/2502.13189):block內部做attention, block彼此間不做attention,query與所有block做attention。MOBA加入了gating 決定query要與哪個block做attention, 類似MOE的概念。


[Deja vu](https://arxiv.org/pdf/2310.17157)
:只選擇需要的attention head 和 MLP weight 做 inference time 的計算。
Idea:
AOC + LORA ✘
AOC + Block Attention
AOC + MOBA
AOC + Deja vu
## Self-supervised Query-aware training
Given input $\mathbf{x}=$ $\left(\mathbf{x}^{i n s}, \mathbf{x}^{d_1}, \ldots, \mathbf{x}^{d_K}, \mathbf{x}^{d^\prime_1}, \ldots, \mathbf{x}^{d^\prime_L}, \mathbf{x}^q\right)$
$\mathbf{x}^{d}$: tokens of relevant document.
$\mathbf{x}^{d^\prime}$: tokens of irrelevant document.
Objective:
$\mathcal{L}_{\mathrm{AE}}=\max _{\Theta_{L o R A}, e_m} P\left(\mathbf{x}^{d} \mid [\mathbf{x}^{d}, \mathbf{x}^{d^\prime}] ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$
preliminary:
[Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models](https://arxiv.org/pdf/2210.03162):(還沒細讀)想法是將prompt compression 應用在Toxicity Reduction,並透過contrastive conditioning使得壓縮後的content不包含 toxic words。

idea:
將irrelavent視為toxic words,做Contrastive Conditioning。
# 2/17
## proposal introduction
目前prompt compression 分為 soft prompt 的方法,compression的主要目標是使得壓縮後的內容能夠還原成原來的內容,以及壓縮後的文字能夠幫助LM達到甚至是超越原本給定未壓縮內容的表現。然而,在RAG的情況下,內容(也就是根據query所檢索的文檔)可能會包含不相關的文字,導致LLM聲稱出錯誤的答案。因此,我認為在壓縮時同時考慮域query的相關性可以提高準確率(EM, F1 score),提高壓縮率以及加速推論時間。
# 2/14
[整理](https://docs.google.com/spreadsheets/d/1dowFVNoIoy29bhnm8_XuZZ3iqkCDoNG3I6w1tEGJLOs/edit?usp=sharing)
| Method | Objective |
|:-------------- | --------- |
| xRAG | **pretraining**<br/> reconstruction <br/><font color="#AC19C9">$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(d_i \mid \mathbf{W}(\mathrm{E}), \mathbf{X}_{\text {instruct }}, d_{<i}\right)$</font><br/>**instruction finetuning**<br/>language modeling<br/><font color="#1936C9">$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(\mathbf{X}_{\text {answer }, i} \mid \mathbf{W}\left(\mathrm{E}_{\text {context }}\right), \mathbf{X}_{\text {question }}, \mathbf{X}_{\text {answer },<i}\right)$</font><br/>Distillation<br/><font color="#F7A004">$\mathcal{L}_{\mathrm{kl}}=D_{\mathrm{KL}}\left(p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{X}_{\text {context }}, \cdot\right) \| p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{W}\left(\mathrm{E}_{\text {context }}\right), \cdot\right)\right)$</font>|
| GIST | <font color="#1936C9">**instruction finetuning with attention mask**</font> |
| ICAE | **pretraining**<br/>reconstruction<br/><font color="#AC19C9"> $\mathcal{L}_{\mathrm{AE}}=\max _{\widetilde{m_1}, \ldots, \widetilde{m_k}} P\left(\boldsymbol{c} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{c} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$</font><br/>text continuation<br/><font color="#f00">$\mathcal{L}_{\mathrm{LM}}=\max _{\widetilde{m_1}, \ldots, \widetilde{m_k}} P\left(\boldsymbol{o} \mid \widetilde{m_1}, \ldots, \widetilde{m_k} ; \Theta_{L L M}\right)=\max _{\Theta_{L o R A}, e_m} P\left(\boldsymbol{o} \mid m_1 \ldots m_k ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)$</font><br/>**instruction finetuning**<br/><font color="#1936C9">$\begin{aligned} \mathcal{L}_{\mathrm{FT}} & =\max _{\widetilde{m_1} \ldots \widetilde{m_k}} P\left(r_1 \ldots r_n \mid \widetilde{m_1} \ldots \widetilde{m_k}, p_1 \ldots p_m ; \Theta_{L L M}\right) \\ & =\max _{\Theta_{L o R A}, e_m} P\left(r_1 \ldots r_n \mid m_1 \ldots m_k, p_1 \ldots p_m ; \Theta_{L L M}, \Theta_{L o R A}, e_m\right)\end{aligned}$</font> |
| 500xCompressor | **pretraining**<br/><font color="#AC19C9">$\mathcal{L}_{\mathrm{P}}=-\sum_{i=1}^l \log P\left(t_i \mid \mathbf{H}_{\mathbf{C}},[\mathbf{B O S}], t_{1: i-1} ; \mathbf{\Theta}_{\mathrm{LLM}}, \boldsymbol{\Theta}_{\mathrm{Lora}}\right)$</font>font><br/>**instruction finetuning**<br/><font color="#1936C9">$\mathcal{L}_{\mathrm{F}}=-\sum_{j=1}^n \log P\left(a_j \mid \mathbf{H}_{\mathbf{C}}, q_{1: m}, a_{1: j-1} ; \boldsymbol{\Theta}_{\mathrm{LLM}}, \boldsymbol{\Theta}_{\mathrm{Lora}}\right)$</font> |
| COCOM | **pretraining**<br/>reconstruction<br/><font color="#AC19C9">$\mathcal{L}\left(\theta_{L L M}, \phi_{\text {comp }}\right)=-\sum_{x_t \in \mathcal{X}} \log P_{\theta_{L L M}}\left(x_t \mid \mathcal{E}, x_1, \ldots, x_{t-1}\right)$</font><br/>text continuation<br/><font color="#f00">$\mathcal{L}\left(\theta_{L L M}, \phi_{c o m p}\right)=-\sum_{x_t \in \mathcal{X}_B} \log P_{\theta_{L L M}}\left(x_t \mid \phi_{c o m p}\left(X_A\right), x_1, \ldots, x_{t-1}\right)$</font><br/>**instruction finetuning**<br/><font color="#1936C9">$\mathcal{L}\left(\theta_{L L M}, \phi_{c o m p}\right)=-\sum_{r_t \in R} \log P_{\theta_{L L M}}\left(r_t \mid I_{\mathcal{E}, q}, r_1, r_2, \ldots, r_{t-1}\right)$</font> |
| ReComp | **contrastive learning fro extractive compressor**<br/><font color="#409913">$-\log \frac{e^{\operatorname{sim}\left(\mathbf{x}_{\mathbf{i}}, p_i\right)}}{e^{\operatorname{sim}\left(\mathbf{x}_{\mathbf{i}}, p_i\right)}+\sum_{n_j \in N_i} e^{\operatorname{sim}\left(\mathbf{x}_{\mathbf{i}}, n_j\right)}}$</font><br/><font color="#1936C9">**instruction finetuning for abstractive compressor**</font> |

在compressor-decoder的架構裡,Compressor只會input要壓縮的文檔而沒有query。只有在decoder的部份會加入query並將output與正確答案比較。對應的loss 是 instruction finetuning <font color="#1936C9">$\mathcal{L}\left(\theta_{L L M}, \phi_{c o m p}\right)=-\sum_{r_t \in R} \log P_{\theta_{L L M}}\left(r_t \mid I_{\mathcal{E}, q}, r_1, r_2, \ldots, r_{t-1}\right)$</font>。
這樣的loss會考慮到如何壓縮文字使得decoder可以輸出正確答案。然而對於如何去除文檔裡不相關的部份是比較impiclit的。
因此我認為讓compressor 同時input query和文檔,並讓compressor去學習除了壓縮外還能去除不相關的文字。
## 方法
Relevance-aware Compression
## KPI
Compared to [ICAE](https://arxiv.org/abs/2307.06945)
| method | compression rate | speedup | F1 score |
|:------:|:----------------:| ------- |:--------:|
| ICAE | 4X | 3x | 44.48 |
| Ours | 6X | 3.5x | 46 |
- 在ICAE中沒有計算FLOPs, 所以我用speedup 代替。
- Compression rate 部份可以提升是因為在我們的方法(預期)可以去除不相關的文字,這是在ICAE和其他soft-prompt 壓縮方法沒考慮到的。
- speedup 可以提昇是因為壓縮的更小了,因此計算量更小,速度也會提昇。
- F1 score 是計算生成答案與正確答案的匹配度。因為我們的方法會去除不相關的文字,因此可以避免LLM因為不必要的文字而生成出錯誤的答案。ICAE的F1 score 是從[500xCompressor](https://arxiv.org/abs/2408.03094)得到的。
- 我們的方法主要的trade off 會是在training的計算量。除了原本的traning stage外還會加入relavence-aware conpression的training讓compressor學習去除不相關的文字。
# 2/7
xRAG

$1*4096(dim_{ret})$ -> projector -> $1*dim_{llm}$
### training objective:
1.Paraphrase Pretraining
$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(d_i \mid \mathbf{W}(\mathrm{E}), \mathbf{X}_{\text {instruct }}, d_{<i}\right)$
2.Context-aware Instruction Tuning
Language Modeling
$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(\mathbf{X}_{\text {answer }, i} \mid \mathbf{W}\left(\mathrm{E}_{\text {context }}\right), \mathbf{X}_{\text {question }}, \mathbf{X}_{\text {answer },<i}\right)$
Self-Distillation
$\mathcal{L}_{\mathrm{kl}}=D_{\mathrm{KL}}\left(p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{X}_{\text {context }}, \cdot\right) \| p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{W}\left(\mathrm{E}_{\text {context }}\right), \cdot\right)\right)$
$\mathcal{L}_{n l l}+\alpha \mathcal{L}_{k l}$
### Question
- Is it that simple to mitigate the modality gap?

可以看到在xRAG別篇(COCOM)的實驗結果很差
- How to compressed to different compression rate?

# 1/23
## 主要論文來源
xRAG

## 方法
原本model的輸入長度是 $|D| + |q|$。
該研究方法透過學習一個projector 使得 model的輸入長度為 $1 + |q|$
為了讓LLM能夠使用來自retriever model的embedding, 做這設計了兩階段的訓練。(projector 是唯一需要學習的模組,其他都是frozen的)
- Paraphrase Pretraining
$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(d_i \mid \mathbf{W}(\mathrm{E}), \mathbf{X}_{\text {instruct }}, d_{<i}\right)$
使得LM可以從W(E) recover 原本檢索的文檔
- Context-aware Instruction Tuning
$\mathcal{L}_{\text {nll }}=-\sum_{i=1} \log p_\phi\left(\mathbf{X}_{\text {answer }, i} \mid \mathbf{W}\left(\mathbf{E}_{\text {context }}\right), \mathbf{X}_{\text {question }}, \mathbf{X}_{\text {answer },<i}\right)$
讓LM的可以只input W(E) 而輸出正確的答案。
$\mathcal{L}_{\mathrm{kl}}=D_{\mathrm{KL}}\left(p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{X}_{\text {context }}, \cdot\right) \| p_\phi\left(\mathbf{X}_{\text {answer }} \mid \mathbf{W}\left(\mathrm{E}_{\text {context }}\right), \cdot\right)\right)$
讓使用原本的text 的分佈和使用壓縮後的embedding的分佈靠近
Total loss of second stage = $\mathcal{L}_{n l l}+\alpha \mathcal{L}_{k l}$.
## 特別的重點
- modality fusion
這邊的modality指的是 retriever model 和 language model 的 modality。透過學習一個projector使得vectorstore的embedding可以直接轉換成language model的embedding。這是目前在我所survey的paper中唯一使用這個方法的。
## 優點
滿足 low cost,相較於其他方法,該研究只使用了兩層MLP加上原本的檢索模型,在effcieny上達到了非常好的效果。
Loss的設計非常周到
## 缺點
- constrained to noly RAG senarios。因為需要檢索模型的embedding(也就是vectorstore)。
- 壓所比率的限制。壓縮應該考慮合理的壓縮比率以避免過多的壓縮造成的語意失真。此外,在本研究只考慮了top-1的檢索,之所以或這樣我想是因為該訓練方法無法很有彈性的調整token的長度。有些檢索可能包含了很多的資訊量,不能用一個token就表示完成。因此加入某些條件使得LM能夠適應彈性的長度是非常重要的。
## 可以改進的方向
1. 加入一個模組決定總壓縮token的長度。並且加入變動長度的訓練方式使得模型能夠適應不同程度的壓縮。
2. 不加入額外的模組,而是在projector(原本的2層MLP)加入某些可以變動壓縮後token長度的參數,也許可以改成bert-based的model。

動態長度的方法可以參考COCOM, 一樣是透過一個模組生成embedding,不一樣的是COCOM的input是文字,而xRAG的input是embdding(from retriever model)
## 簡言之
原本的xRAG是壓縮成一個token的embedding,而我想要壓縮長度是動態的,這樣比較不會有語意的流失。
## 其他論文來源
大致上的整理而已,比較粗糙。我認為最相關的paper是COCOM,其他paper與xRAG的相關性可能還要在探索。
[其他RAG壓縮相關研究整理](https://docs.google.com/spreadsheets/d/1dowFVNoIoy29bhnm8_XuZZ3iqkCDoNG3I6w1tEGJLOs/edit?usp=sharing)
# 1/10
[BLOCK-ATTENTION FOR EFFICIENT RAG, 2024](https://arxiv.org/abs/2409.15355)

該方法包括了三個部份
- BLOCK SEGMENTATION
作者假設檢索互為獨立,因此以檢索為單位來分割。這部份我認為可以再優化。
- POSITION RE-ENCODING
同段檢索可能會出現在不同的input但在不同位置,且kv cache位置編碼有關,因此需重新編碼位置,這樣才可以使用KV cache。

- BLOCK FINE-TUNE
根據block attention finetune 模型
## 實驗


## 問題和可能的方向
- Block segmentation 間是否可能不互為獨立。可能的方法是透過某種方式去計算玟檔間的依賴性,並重性設計attention的方法。
- 蒸餾小的語言模型,根據self-information和mutual information計算檢索間的依賴程度,再根據依賴程度切割檢索成不同的Block。
- 
這篇是google 的論文 sparse RAG,通過模型計算使否保留檢索,並保留KV值用在之後的生成。也許可以使用類似的方法計算依賴程度,但因為該研究實驗並沒有表現出推論時間的優勢,這邊可能還要在想想改進的方法。
- 使用這樣的方法是否會使得不同的block attention的可能性更多,讓kv-cache無法展現優勢。
- 使用KV-cache為何會增加準確度?
[PCW, Parallel Context Windows for Large Language Models, ACL 2023](https://arxiv.org/abs/2212.10947)

跟block attention 一樣 分為三個部份
- carve a long context into chunks (“windows”)
- restrict the attention mechanism
- re-use the positional embeddings across the windows
但這篇目的是突破LLM的字數限制,與blcok attention加快RAG inference的目的不同。為什麼方法類似但達到的果會不一樣,這邊需要再細看。
在Block attention 中 做這提到使用這個方法在RAG的情況下, 表現(準確度)卻比weak baseline(有使用block attention 但沒有做 block fine-tuning的model)還差。
[PROMPT CACHE: MODULAR ATTENTION REUSE FOR LOW-LATENCY INFERENCE, MLSys 2024](https://arxiv.org/abs/2311.04934)
# 1/3
## selective context

$I\left(x_i\right)=-\log _2 P\left(x_i \mid x_0, x_1, \ldots, x_{i-1}\right)$
LLM input "Large language models (LLMs) have demonstrated remarkable power and impressive generalisation abilities across a wide range of natural language processing tasks."
### conditional probabilities

### self information

## entropy
$\mathrm{H}(X)=-\sum_{i=1}^n p\left(x_i\right) \log _b p\left(x_i\right)$
## mutual information
$\begin{aligned} \mathrm{I}(X ; Y) & \equiv \mathrm{H}(X)-\mathrm{H}(X \mid Y) \\ & \equiv \mathrm{H}(Y)-\mathrm{H}(Y \mid X) \\ & \equiv \mathrm{H}(X)+\mathrm{H}(Y)-\mathrm{H}(X, Y) \\ & \equiv \mathrm{H}(X, Y)-\mathrm{H}(X \mid Y)-\mathrm{H}(Y \mid X)\end{aligned}$
在資訊理論中,Mutual Information(MI)是一個量化兩個隨機變數 X 和 Y 之間關聯性的指標。當 X 和 Y 互相獨立時,MI為 0,因為 X 不包含任何關於 Y 的資訊,反之亦然。而當 X 和 Y 完全相同時,所有由 X 所傳遞的資訊也完全由 Y 共享;在這種情況下,MI等於 X 或 Y 的熵(Entropy),因為知道 X 就等同於知道 Y ,且共享資訊達到最大值。
### **在 LLM 中的計算**:
1. **兩句話相同時的情況**:
- 假設 X 和 Y 是完全相同的兩句話,那麼 X 和 Y 在語義和結構上是一致的,因此它們的MI應該等於 H(X) 或 H(Y) (兩者熵相等)。這是因為知道 X 的內容完全確定了 Y 的內容,兩者共享的資訊達到了最大值。此時,條件熵 H(X|Y) 和 H(Y|X) 都為 0。
2. **兩句話不相同但相關時的情況**:
- 若 X 和 Y 是不同的句子,但它們之間存在一定的語義關聯或依賴性,MI會反映 X 和 Y 之間的共享資訊量。在這種情況下,條件熵 H(X|Y) 和 H(Y|X) 小於各自的熵 H(X) 和 H(Y) ,導致MI, I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) 大於 0。
3. **兩句話完全不相關時的情況**:
- 若 X 和 Y 是完全無關的句子(統計獨立),則MI I(X;Y) 為 0,因為 X 和 Y 不共享任何資訊。此時,條件熵 H(X|Y) = H(X) 且 H(Y|X) = H(Y) 。
---
### example of similar retieval
- from Introduction
In this paper, **we propose Selective Context, which prunes redundant content in a given input context, thereby reducing the computational cost and making better use of the fixed context length in LLMs**. Selective Context evaluates informativeness of lexical units (i.e., tokens, phrases, or sentences) with self-information (Shannon, 1948) computed by a base causal language model. By selectively retaining content with higher self-information, our method provides a more compact and efficient context representation for LLMs to process without compromising their performance on various applications.
- from conclusion
**We introduced Selective Context to improve the context efficiency of LLMs in inference by deleting redundant content measured by self-information.** Our extensive experiments on arXiv papers, BBC news articles, and conversation transcripts showed that our proposed method can significantly reduce GPU memory cost, accelerate generation with minor performance decrease, and potentially enable LLMs to handle long documents and extended conversations without the risk of context truncation.
## 提示壓縮

### Hard Prompt Methods
#### SelectiveContext

This method removes redundant or less informative parts of the prompt by quantifying informativeness using self-information.
#### LLMLingua

LLMLingua uses a smaller language model (e.g., GPT-2) to calculate self-information or perplexity and removes redundant tokens.
#### Nano-Capsulator

Nano-Capsulator summarizes the prompt into a concise version, removing irrelevant information while retaining key meanings through semantic preservation. It requires more computational resources due to the additional inference step but enhances task performance by optimizing prompt utility.
### Soft Prompt Methods
#### Contrastive Conditioning (CC)

CC trains a shorter soft prompt to approximate the output distribution of a natural language prompt by minimizing the Kullback-Leibler (KL) divergence. This method can fine-tune soft prompts to produce specific attributes (e.g., enhanced sentiment). However, each soft prompt is specific to one natural language prompt, limiting its generalization ability.
#### Gist Tokens (GIST)

GIST modifies the attention mechanism in the LLM by appending compressed tokens after the original prompt. These compressed tokens attend only to each other, separating attention flows. GIST allows compression of unseen prompts without fine-tuning, but it is limited by the maximum compressive prompt length.
#### AutoCompressor

AutoCompressor handles long context compression by dividing the original prompt into sub-prompts and compressing them recursively. This method can compress up to 30,720 tokens but requires a time-consuming training process and the compressed tokens cannot be used with untuned LLMs.
### Challenge
#### Attention:
- current prompt compression methods have not been compared to traditional attention optimization methods, such as sliding window attention and sparse attention.
- Hard prompts may be compressed based on attention rather than the calculation of self-information.
### Information Theory
## 判斷檢索與問題的關聯性
### inspiration
1. [MAKING RETRIEVAL-AUGMENTED LANGUAGE MODELS ROBUST TO IRRELEVANT CONTEXT](https://arxiv.org/pdf/2310.01558), ICLR 2024.
在這篇paper提到了使用NLI(natrural language inference) model作為baseline。
q:qestion
a:answer
d:retrieved document
NLI(q,a,d) = entailed, neutral or contradicted
or
NLI(q,a,d) = Low-Entailment, Med-Entailment or High-Entailment

可以看到在某些Low-Entailment 情況下使用檢索一樣會提升performance.
這代表了NLI並無法實際反應檢索對於生成的效果。
為什麼? 因為NLI並沒有考慮到model的內部知識(parametric knowledge)與retireved document的關係。
2. [RE-RAG: Improving Open-Domain QA Performance and Interpretability
with Relevance Estimator in Retrieval-Augmented Generation](https://aclanthology.org/2024.emnlp-main.1236.pdf)
這篇使用了類似NLI的作法。叫做Relevance Estimator。

這邊Relevance Estimator的角色是reranker。
提出了jointly train RE and LLM
- $\mathbf{L}_{\mathrm{tot}}=\mathbf{L}_{\mathrm{gen}}+\alpha_1 \mathbf{L}_{\mathrm{re}}+\alpha_2 \mathbf{L}_{\mathrm{tok}}$
- $\mathbf{L}_{\mathrm{gen}}=-\sum_{i, j} \log \left(\mathbf{P}_{\mathbf{R E}}\left(\mathbf{q}_i, \mathbf{c}_j\right) \cdot \mathbf{P}_G\left(\mathbf{a}_i \mid \mathbf{q}_i, \mathbf{c}_j\right)\right)$
- $\mathbf{L}_{\text {tok }}=\sum_{t \in T \backslash\{\text { "true","false" }\}} \mathbf{P}\left(t \mid \mathbf{q}_i, \mathbf{c}_k\right)$
- $\mathbf{L}_{\mathrm{re}}=D_{\mathrm{KL}}\left(\mathbf{P}_{R E}\left(\mathbf{q}_i, \mathbf{c}_j\right) \| \mathbf{Q}_G\left(\mathbf{q}_i, \mathbf{c}_j\right)\right)$
- $\begin{gathered}\mathbf{F}_{i, j}=\log \left(\mathbf{P}_G\left(\mathbf{a}_i \mid \mathbf{q}_i, \mathbf{c}_j\right)\right) \\ \mathbf{Q}_G\left(\mathbf{q}_i, \mathbf{c}_j\right)=\frac{e^{\mathbf{F}_{i, j}}}{\sum_k e^{\mathbf{F}_{i, k}}}\end{gathered}$
3. [DISCOVERING LATENT KNOWLEDGE IN LANGUAGE
MODELS WITHOUT SUPERVISION](https://arxiv.org/pdf/2212.03827)


CCS: 使用latent來判對是否真假
0-shot: 直接讓model output是否真假
可以看到使用latent判對可以的到更準確的結果。而這樣的想法可以應用在判斷document 與question間的關聯性。
4. [Are LLMs Aware that Some Questions are not Open-ended?](https://arxiv.org/pdf/2410.00423)

這篇paper雖然是要判斷LLM是否知道問題是open-domain的。但一樣都使用了letent來判斷。
此邊paper使用了最後一層,但也許有機會在前面幾層就可以達到很好的效果。
### 想法
在1.(NLI) 提出了為什麼NLI無法達到很好的效果是因為並沒有考慮到model的內部知識。
在2.(Relevance Estimator) 也有使用了類似NLI的方法來rerank 檢索
在3.(DISCOVERING LATENT KNOWLEDGE) 4.(LLMs aware Open-endedQuestions) 都使用了latent來判斷。
因此我認為
- 可以用類似1.(NLI)2.(Relevance Estimator)的方法加上latent來判斷檢索的正確性。
- 並且可以用2.(Relevance Estimator)的joint traning方法加上adversarial training(有paper用在訓練LLM對檢索的穩健性)來訓練estimator和LLM。
- 好處
- 使用latent可以做更好的判斷。
- joint traing 可以使estimator和LM在目標上更加一致。LLM 的生成結果可以反饋給estimator,幫助其更精確地學習如何評估檢索的相關性。
- 壞處
- 缺少理論支持,需要survey更多支持相關理論的paper。
簡單來說 RE(q,d) -> RE(q,d,latent(q,d))
## 10/4
### [RAG Engancement](https://github.com/hymie122/RAG-Survey?tab=readme-ov-file)

#### Input Enhancement
- Query Transformation
- It enhances the result of retrieval by modifying the input query
- Data Augmentation
- It improves data before retrieval, including techniques such as **removing irrelevant information**, **eliminating ambiguity**, **updating outdated documents**, **synthesize new data**, etc.
#### Retriever Enhancement
- Recursive Retrieval
- It performs multiple searches to retrieve richer and higher-quality contents.
- Chunk Optimization
- It adjusts chunk size for improved retrieval results.
- Retriever Finetuning
- The retriever, central to the RAG system, relies on a proficient embedding model to represent related content and feed the generator, enhancing system performance.
- Hybrid Retrieval
- Re-ranking
- Retrieval Transformation
#### Generator Enhancement
- Prompt Engineering
- Technologies in prompt engineering that focus on improving the quality of LLMs’ output, such as prompt compression, Stepback Prompt , Active Prompt, Chain of Thought Prompt, etc., are all applicable to LLM generators in RAG systems.
- Decoding Tuning
- Decoding tuning involves enhancing generator control by fine-tuning hyperparameters for increased
diversity and constraining the output vocabulary, among other adjustments.
#### Result Enhancement
- Output Rewrite
#### RAG Pipeline Enhancement
- Adaptive Retrieval
- it determines when and what to retrive.
- Rule-based
- Model-based
- Iterative RAG
- Iterative RAG progressively refines results by repeatedly cycling through retrieval and generation phases, rather than a single round.
#### Papers to studty
-[Retro](https://arxiv.org/pdf/2112.04426)


- It finetunes LLM with Retrival data.
- 878 Citations
- [Atlas](https://arxiv.org/pdf/2208.03299)
- 
## 10/28
### pipeline enhancement
### retriever enhancement
----
### pipeline enhancement
#### adaptive retreval: rule-based or model-based
- FLARE(rule-based)
Generate first, then use the probabilities of each tokens as a rule to determine retrieval step.


- SELF-RAG(model-based)
Train the LM to determine whether to retrieve, and train another LM to determine if the detrieved document is useful/supportive/relevant.


My thought on model-based adaptive retrieval:
- if we can exrtact latent representations and use it to determine when to retrive.
- Representations mostly done on sequence-to-sequence(encoder-decoder) model.
Fusion-in-Decoder method.

Elicting latenet knowledege

Maybe it is possible to use prefix tuning to train the LM to determine whether to retrieve.
Or, train a addtional model which input is the representations of LM and output is to determine whether to retrieve and whether there is noisy document in the input.
**benchmark**
- Multihop QA: HotpotQA, 2WikiMultiHopQA ...
- Benchmarking Large Language Models in RAG

---
### retriever enhancement
Mostly about effiency and, secondly, accuracy
bottleneck: it is costly to go through all the document when training retriever.
Corrector Networks

Use stale embedding to determine some possible document and then update the retriever.
train a corrector(relatively small) to refresh the stale embedding.
In addtion to train only the rertiever, it also prorvides a way to train retriever jointly with LM, in a unsupervise way(to retriever).

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

It uses a proposed method ACNE(Approximate
nearest neighbor Negative Contrastive Learning ) to selects hard training negatives globally from the entire corpus, using an asynchronously updated ANN index.
---
Retriever enhancement mostly envolves more theoratical analysis.
In addition to only training the retriever, I prefer to find a way to joitly training method with RAG.