<!-- #### Meeting Schedule
<iframe src="https://calendar.google.com/calendar/embed?src=nycumllab%40gmail.com&ctz=Asia%2FTaipei" style="border: 0" width="700" height="400" frameborder="0" scrolling="no"></iframe>
-->
# Slide
[0429](https://docs.google.com/presentation/d/1AFw65eoL2MY0rt738oOKzOjmp4OTVIB0KiF522I5I5U/edit?usp=sharing)
# NLG PROGRESS
# Papers we reference and follow :
[paper list on github](https://github.com/Timothyxxx/RetrivalLMPapers)
# [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/pdf/2002.08909.pdf)
- Google Research
- ICML 2020
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang

$$
\nabla\log p(y|x)=\sum_{z\in\mathcal Z}r(z,x)p(z|x)\nabla f(x,z)\\
r(z,x)=\left[\frac{p(y|z,x)}{\mathbb E[p(y|z,x)]}-1\right]
$$
# [Decoupled Context Processing for Context Augmented Language Modeling](https://arxiv.org/abs/2210.05758)
- Google Research
- Zonglin Li, Ruiqi Guo, Sanjiv Kumar
- NeurIPS 2022

# [Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning](https://arxiv.org/abs/2205.14704)
- NeurIPS 2022 (Spotlight)
- Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, Huajun Chen
- Alibaba Group

# [Unsupervised Dense Information Retrieval with Contrastive Learning](https://arxiv.org/pdf/2112.09118.pdf)
- TMLR 2022
- Meta AI Research



# Our Idea
- Implement
- Implement [Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning](https://arxiv.org/abs/2205.14704)
- Using contriver and LLaMA
- fine tune knowledge encoder

* add vicreg criterion
* add trustbility
* dataset Issue
* knowledge context too long
# 分工
**志軒: leader**
Requirements:
- 資料集切割(和宗)
- prefix tuning coding(志軒)
- contriever fine tuning(needed or not?)(宇喆)
- adding some regularlization on encoder output(fairness?)(伯鈞)
- Website development (jackson)
Bonus:
- Multiple prompt tuning ? (jackson)超 慢
## contriever fine tuning
### VICREG
- 三個超參數需要調整

- postive pair 是 question 和 long answer
| epoch | device | lr | batch size | lr_scher |optimizer | final loss| training time
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |--------
| 200 | 3080ti | 1e-4 | 64| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, patience=10, cooldown=10, min_lr=1e-6)| AdamW(model.parameters(), lr=1e-5, weight_decay=1e-4)| 17.8004|94.4hr
## About prefix tuning

(some describe) There are **40 layers** in LLaMa, each layer needs key and value, the shape of key/value is (40, n, 128) **40 heads** with **128 dimension**, so the final shape is **(40, 2, B, 40, n, 128)**, n is document length, B is batchsize.
(some describe) In the knowledge encoder, we use **80 different heads** to calculate the keys and values of different layers.
(10/29) Finish concatenation of knowencoder and llama(dynanic prefix) at inference.
(10/30 bug fixed) There are still some bug at batch training that when document lengths are different, ~~should be done on collate function~~, be done on models side.
(10/31) Finish all code of dynanic prefix.
(future work) Finish the learning algorithm.
* calculate cross entropy loss of LM output and then calculate gradients of encoder
$$\mathcal L(\theta)=-\sum_{t=0}^t y_t\log(P_{LM}(y_t|y_0,\dots,y_{t-1},q,\text{Enc}_\theta(z)))
$$
$$z=\underset{z\in\mathcal Z}{\arg\max}(\text{Emb}(q)\cdot\text{Emb}(z))
$$
* fairness loss on encoder output
## About dataset prepare
(some describe) cut documents into segment, with **288 tokens** windows size and **64 step** size.
(10/30) Almost finish document segment, need to do some optimization on I/O, rewrite code into multithreads.
<!-- ## Dynamic Prefix with Fairness Enhancement for Retrieval-augmented Question Answering -->
## 11/13
Some problems were discovered when integrating the system:
* To store the entire document embedding in memory.
There are 8 million documents, each converted into a 768-dimensional vector, total memory usage will be 26GB
* Each module has been completed so far, and the remaining integration part is still being dealt with various bugs encountered after integration.
* document set is not available when integrating the retriever
* knowledge encoder not been updated when trained with llama.
* cannot use all GPU resources during training
# 11/21
## Contriever bug fix
- vicreg train 不起來 (validation top5 acc 0.7 and when train longer it become worse )
-> change to infoNCE
- tokenizer max length of long answer 128 -> 512
- long answer candidate -> true long answer
- some long answer are None -> skip in collect func
## infoNCE
| epoch | device | lr | batch size | lr_scher |optimizer | final loss| training time |validation top5 acc
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |--------
| 20 | 4090 | 3e-5 | 32| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20, cooldown=20, min_lr=1e-5)| AdamW(model.parameters(), lr=3e-5, weight_decay=1e-2)|0.0145 | 46*20min|0.8847
## Todo list
|Job|status|date|
|-|-|-|
|retriever pretraining|done, acc 88.47%|11/19|
|prefix tuning forward|done|10/31|
|prefix tuning back propagation|compatibility issues between quantize and backpropagation, changed to LLaMa 7B without quantize|11/13|
|prefix optimization|done|11/23|
|document segmentation|done, 10 million segments|11/15|
|build document embedding|done(runing time 12 hr)|11/21|
|efficient document search|test done, to be intergated, wait for document embedding done|11/19|
|training algorithm|done|11/13|
|intergate doc embedding and search method|ongoing|
|benchmark|Not started yet, will use the benchmarks provided by the data set|---|
|Fairness Loss Term on objective loss| add debiasing layer on the output of llama2 base on [paper](/KWdhaUWBSXONrZRd5RQdWw)|--|
## efficient document search
[Maximum Inner Product Search](https://zhuanlan.zhihu.com/p/111502331)
There have lots of $d-$dimension vectors, they combine a set $X$. We use an input query $q$. We need to find the $p$, which has the maximum inner product from the set $X$.
$$
p = \arg \max_{x \in X} x^T \ q
$$
#### [Faiss](https://github.com/facebookresearch/faiss)
[Have you considered using a vector database like faiss?](https://datascience.stackexchange.com/questions/124615/fastest-way-to-do-maximum-inner-product-search?noredirect=1#comment124438_124615)
Faiss is a library for efficient similarity search and clustering of dense vectors.
Here, we use `IndexIVFFlat`, which is a Faiss index structure that combines the inverted file system with a flat index to enable efficient approximate nearest neighbor search.
- nlist: the number of cluster
- nprobe: the number of cluster that we want to search
```python
import torch
import time
import faiss
# global variables, the number of documents and number of clusters
matrix_size = 5000000
probe_size = 100
search_times = 10
def build_index():
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dim_size = 768
batch_size = 100000 # Adjust the batch size based on available memory
# create the index outside the loop
index = faiss.IndexIVFFlat(faiss.IndexFlatIP(dim_size), dim_size, probe_size)
# start timing
start = time.time()
for i in range(0, matrix_size, batch_size):
# create a matrix using torch
matrix = torch.randn(batch_size, dim_size, dtype=torch.float32).to(device)
# faiss search using torch
# convert the matrix to numpy
matrix = matrix.cpu()
index.train(matrix.numpy()) # Uncomment this line if you want to train the index
index.add(matrix.numpy()) # Uncomment this line if you want to add the matrix to the index
index.nprobe = probe_size
print("Index size:", index.ntotal)
# end timing
end = time.time()
print("Time taken:", end - start)
faiss.write_index(index, "accumulated_index.index")
def search_query():
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# read the index
index = faiss.read_index("accumulated_index.index")
start = time.time()
# create a query vector using torch, and search, 1000000 times
for i in range(search_times):
query = torch.randn(1, 768, dtype=torch.float32).to(device)
index.nprobe = probe_size
query = query.cpu()
D, I = index.search(query.numpy(), 10)
print("Distance:", D)
print("Index:", I)
#print("Index size:", index.ntotal)
end = time.time()
print("Time taken:", end - start)
print("Done")
def inner():
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Doing inner product on device:", device)
start = time.time()
# just multiply two random matrices, size 1000000 x 768
for i in range(search_times):
matrix1 = torch.randn(matrix_size, 768, dtype=torch.float32, device=device)
matrix2 = torch.randn(768, 1, dtype=torch.float32, device=device)
result = torch.mm(matrix1, matrix2)
print("Result:", result)
end = time.time()
print("Time taken:", end - start)
print("Done")
if __name__ == '__main__':
inner()
build_index()
search_query()
```
#### Result
Search 1 document spend time (second)
|# Documents / Method|Inner Product|Faiss (nlist = 100)|Faiss (nlist = $\sqrt{\text{# Documents}}$)|
|--|--|--|--|
|1000000|5.067 $\sim$ 5.155|(6.511) + 0.853|(13.499) + 0.824|
|2000000|8.796 $\sim$ 8.833|(14.209) + 1.773|(28.576) + 1.665|
|3000000|12.991 $\sim$ 13.704|(20.485) + 2.592|(45.360) + 2.563|
|4000000|17.470 $\sim$ 17.956|(28.419) + 3.379|(63.009) + 4.574|
|5000000|21.514 $\sim$ 22.932|(33.143) + 10.499|Shutdown|
Search 10 documents spend time (second)
|# Documents / Method|Inner Product|Faiss (nlist = 100)|Faiss (nlist = $\sqrt{\text{# Documents}}$)|
|--|--|--|--|
|1000000|45.986|(6.420) + 8.583|(13.516) + 8.579|
|2000000|89.593|(13.318) + 17.162|(28.790) + 17.132|
|3000000|135.469|(18.939) + 25.788|(45.298) + 25.501|
|4000000|178.897|(26.638) + 34.471|(62.736) + 34.752|
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTc3skIQdklVh2EAMMtEj68mkuMcmU1Z4hZR9tT4ZM-vp6HQVuLGD9EutMwnJMHYOEjeaJvjTcJgMLn/pubhtml?widget=true&headers=false" width="100%" height="640"></iframe>
## Combination
### combine faiss to our system
- finish code
- some bug
- 10^7 segment to run faiss need very large Ram (>64GB)
- solution
- reduce training data
## https://docs.google.com/document/d/1NXbdYwY4hvqgnWiy3xmD8rOioOd0XsmtAgUvYBygTqM/edit
# 12/4
## Finish combination
start training
| epoch | document segment (512 token)|data sample|spilt size|topk | device | lr | batch size | lr_scher |optimizer | final loss| training time | train EM acc |train token acc
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |--------|--------|--------|--------|--------
| 3 |2*10^5| 10^5 |0.95 /0.05 |1 | 4090 | 1e-5 | 2| None | AdamW(weight_decay=0.01,) |0.927|12hr| 0.2981|0.7799 |
- result
| method |validation EM acc | bert score |
| -------- | -------- | -------- |
| Ours | 0.2636 | 0.6717 |
| No retrieval (only llama) | 0.0400 | 0.4106 |
[Question Answering on Natural Questions (Benchmark2)](https://paperswithcode.com/sota/question-answering-on-natural-questions)



## TODO
- compare state-of-the-art method
- prefix tuning
- ...
- ablation studies
- contreiver no finetune
- ...
# NEXT update
## Retrain contreiver / knowledge encoder and llama
#### contriever training data有包含到knowledge encoder and llama的validation data
#### random_seed=708
- contriever
| epoch | device |data sample|spilt size| lr | batch size | lr_scher |optimizer | final loss| training time |validation top5 acc
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |-------- |--------
| 5 | 4090 |98708| 0.95 /0.05 | 3e-5 | 32| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20, cooldown=20, min_lr=1e-5)| AdamW(model.parameters(), lr=3e-5, weight_decay=1e-2)|0.0472| 23*5 min|0.9091
- knowledge encoder and llama
<!-- | epoch | document segment (512 token)|data sample|spilt size|topk | head | device | lr | batch size | lr_scher |optimizer | final loss| training time | train EM acc |train token acc
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |--------|--------|--------|--------|--------|--------
| 5 |2*10^5| 98708|0.95 /0.05 |1 |4| 4090 | 1e-5 | 2| None | AdamW(weight_decay=0.01,) |0.92|8*5hr| 0.29| 0.77|
- result
| method |validation EM acc | bert score |
| -------- | -------- | -------- |
| Ours | 0.2607 | 0.6665 |
| No retrieval (only llama) | 0.0400 | 0.4106 |
-->
| epoch | document segment (512 token)|data sample|spilt size|topk |head | device | lr | batch size | lr_scher |optimizer | final loss| training time | train EM acc |train token acc
| -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |--------|--------|--------|--------|--------|--------
| 6 |2*10^5| 98708|0.95 /0.05 |1 |2| 4090 | 2e-5 | 2| None | AdamW(weight_decay=0.01,) | 0.897|8*6hr|0.3173 |0.7882 |
- result
| method |validation EM acc | bert score |
| -------- | -------- | -------- |
| Ours | 0.2620 | 0.6673 |
| No retrieval (only llama) | 0.0286 | 0.3999 |
## Temporary Metric
**BLEURT: a Transfer Learning-Based Metric for Natural Language Generation**
## idea for novelty
1. 志軒: 必須加上segments self supervised learning在retriever上
2. hallucination metric on model generation(for inference)
3. hallucination metric on model output distribution(for training with teacher forcing)
4. make it differentiable???
## system optimization
1. 考慮減少dim 及 segments length
2. knowledge encoder architecture modify, because 10 prefix vectors is enough
# [Overleaf link](https://www.overleaf.com/7754299761tmdpwkxcbkdf#0b5d95)
# [NLG PPT link](https://nycu1-my.sharepoint.com/:p:/g/personal/present90308_ee11_m365_nycu_edu_tw/EbD5ocAF2AZNgqyfzxriSpABEF72y75poGMSj8sDlIsgOw?rtime=U8cmeyz620g&nav=eyJzSWQiOjI1NiwiY0lkIjoxMDk4NTcyMjJ9)
## current system implement detail
### Pre-train
### doc build