NLG Progress - HackMD

# Slide [0429](https://docs.google.com/presentation/d/1AFw65eoL2MY0rt738oOKzOjmp4OTVIB0KiF522I5I5U/edit?usp=sharing) # NLG PROGRESS # Papers we reference and follow : [paper list on github](https://github.com/Timothyxxx/RetrivalLMPapers) # [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/pdf/2002.08909.pdf) - Google Research - ICML 2020 - Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang ![](https://hackmd.io/_uploads/SyX7-SVWT.png) $$ \nabla\log p(y|x)=\sum_{z\in\mathcal Z}r(z,x)p(z|x)\nabla f(x,z)\\ r(z,x)=\left[\frac{p(y|z,x)}{\mathbb E[p(y|z,x)]}-1\right] $$ # [Decoupled Context Processing for Context Augmented Language Modeling](https://arxiv.org/abs/2210.05758) - Google Research - Zonglin Li, Ruiqi Guo, Sanjiv Kumar - NeurIPS 2022 ![](https://hackmd.io/_uploads/HkVHWBNWT.png) # [Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning](https://arxiv.org/abs/2205.14704) - NeurIPS 2022 (Spotlight) - Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, Huajun Chen - Alibaba Group ![](https://hackmd.io/_uploads/Sy93brEZ6.png) # [Unsupervised Dense Information Retrieval with Contrastive Learning](https://arxiv.org/pdf/2112.09118.pdf) - TMLR 2022 - Meta AI Research ![](https://hackmd.io/_uploads/SkgsxSEW6.png) ![](https://hackmd.io/_uploads/SkgLerVb6.png) ![](https://hackmd.io/_uploads/Sy9Ilr4bp.png) # Our Idea - Implement - Implement [Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning](https://arxiv.org/abs/2205.14704) - Using contriver and LLaMA - fine tune knowledge encoder ![](https://hackmd.io/_uploads/ry_pLiPGp.png) * add vicreg criterion * add trustbility * dataset Issue * knowledge context too long # 分工 **志軒: leader** Requirements: - 資料集切割(和宗) - prefix tuning coding(志軒) - contriever fine tuning(needed or not?)(宇喆) - adding some regularlization on encoder output(fairness?)(伯鈞) - Website development (jackson) Bonus: - Multiple prompt tuning ? (jackson)超慢 ## contriever fine tuning ### VICREG - 三個超參數需要調整 ![](https://hackmd.io/_uploads/Sk2zNDYM6.png) - postive pair 是 question 和 long answer | epoch | device | lr | batch size | lr_scher |optimizer | final loss| training time | -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- | 200 | 3080ti | 1e-4 | 64| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, patience=10, cooldown=10, min_lr=1e-6)| AdamW(model.parameters(), lr=1e-5, weight_decay=1e-4)| 17.8004|94.4hr ## About prefix tuning ![](https://hackmd.io/_uploads/BJfaLzAMa.png =500x) (some describe) There are **40 layers** in LLaMa, each layer needs key and value, the shape of key/value is (40, n, 128) **40 heads** with **128 dimension**, so the final shape is **(40, 2, B, 40, n, 128)**, n is document length, B is batchsize. (some describe) In the knowledge encoder, we use **80 different heads** to calculate the keys and values of different layers. (10/29) Finish concatenation of knowencoder and llama(dynanic prefix) at inference. (10/30 bug fixed) There are still some bug at batch training that when document lengths are different, ~~should be done on collate function~~, be done on models side. (10/31) Finish all code of dynanic prefix. (future work) Finish the learning algorithm. * calculate cross entropy loss of LM output and then calculate gradients of encoder $$\mathcal L(\theta)=-\sum_{t=0}^t y_t\log(P_{LM}(y_t|y_0,\dots,y_{t-1},q,\text{Enc}_\theta(z))) $$ $$z=\underset{z\in\mathcal Z}{\arg\max}(\text{Emb}(q)\cdot\text{Emb}(z)) $$ * fairness loss on encoder output ## About dataset prepare (some describe) cut documents into segment, with **288 tokens** windows size and **64 step** size. (10/30) Almost finish document segment, need to do some optimization on I/O, rewrite code into multithreads.  ## 11/13 Some problems were discovered when integrating the system: * To store the entire document embedding in memory. There are 8 million documents, each converted into a 768-dimensional vector, total memory usage will be 26GB * Each module has been completed so far, and the remaining integration part is still being dealt with various bugs encountered after integration. * document set is not available when integrating the retriever * knowledge encoder not been updated when trained with llama. * cannot use all GPU resources during training # 11/21 ## Contriever bug fix - vicreg train 不起來 (validation top5 acc 0.7 and when train longer it become worse ) -> change to infoNCE - tokenizer max length of long answer 128 -> 512 - long answer candidate -> true long answer - some long answer are None -> skip in collect func ## infoNCE | epoch | device | lr | batch size | lr_scher |optimizer | final loss| training time |validation top5 acc | -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- | 20 | 4090 | 3e-5 | 32| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20, cooldown=20, min_lr=1e-5)| AdamW(model.parameters(), lr=3e-5, weight_decay=1e-2)|0.0145 | 46*20min|0.8847 ## Todo list |Job|status|date| |-|-|-| |retriever pretraining|done, acc 88.47%|11/19| |prefix tuning forward|done|10/31| |prefix tuning back propagation|compatibility issues between quantize and backpropagation, changed to LLaMa 7B without quantize|11/13| |prefix optimization|done|11/23| |document segmentation|done, 10 million segments|11/15| |build document embedding|done(runing time 12 hr)|11/21| |efficient document search|test done, to be intergated, wait for document embedding done|11/19| |training algorithm|done|11/13| |intergate doc embedding and search method|ongoing| |benchmark|Not started yet, will use the benchmarks provided by the data set|---| |Fairness Loss Term on objective loss| add debiasing layer on the output of llama2 base on [paper](/KWdhaUWBSXONrZRd5RQdWw)|--| ## efficient document search [Maximum Inner Product Search](https://zhuanlan.zhihu.com/p/111502331) There have lots of $d-$dimension vectors, they combine a set $X$. We use an input query $q$. We need to find the $p$, which has the maximum inner product from the set $X$. $$ p = \arg \max_{x \in X} x^T \ q $$ #### [Faiss](https://github.com/facebookresearch/faiss) [Have you considered using a vector database like faiss?](https://datascience.stackexchange.com/questions/124615/fastest-way-to-do-maximum-inner-product-search?noredirect=1#comment124438_124615) Faiss is a library for efficient similarity search and clustering of dense vectors. Here, we use `IndexIVFFlat`, which is a Faiss index structure that combines the inverted file system with a flat index to enable efficient approximate nearest neighbor search. - nlist: the number of cluster - nprobe: the number of cluster that we want to search ```python import torch import time import faiss # global variables, the number of documents and number of clusters matrix_size = 5000000 probe_size = 100 search_times = 10 def build_index(): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") dim_size = 768 batch_size = 100000 # Adjust the batch size based on available memory # create the index outside the loop index = faiss.IndexIVFFlat(faiss.IndexFlatIP(dim_size), dim_size, probe_size) # start timing start = time.time() for i in range(0, matrix_size, batch_size): # create a matrix using torch matrix = torch.randn(batch_size, dim_size, dtype=torch.float32).to(device) # faiss search using torch # convert the matrix to numpy matrix = matrix.cpu() index.train(matrix.numpy()) # Uncomment this line if you want to train the index index.add(matrix.numpy()) # Uncomment this line if you want to add the matrix to the index index.nprobe = probe_size print("Index size:", index.ntotal) # end timing end = time.time() print("Time taken:", end - start) faiss.write_index(index, "accumulated_index.index") def search_query(): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # read the index index = faiss.read_index("accumulated_index.index") start = time.time() # create a query vector using torch, and search, 1000000 times for i in range(search_times): query = torch.randn(1, 768, dtype=torch.float32).to(device) index.nprobe = probe_size query = query.cpu() D, I = index.search(query.numpy(), 10) print("Distance:", D) print("Index:", I) #print("Index size:", index.ntotal) end = time.time() print("Time taken:", end - start) print("Done") def inner(): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print("Doing inner product on device:", device) start = time.time() # just multiply two random matrices, size 1000000 x 768 for i in range(search_times): matrix1 = torch.randn(matrix_size, 768, dtype=torch.float32, device=device) matrix2 = torch.randn(768, 1, dtype=torch.float32, device=device) result = torch.mm(matrix1, matrix2) print("Result:", result) end = time.time() print("Time taken:", end - start) print("Done") if __name__ == '__main__': inner() build_index() search_query() ``` #### Result Search 1 document spend time (second) |# Documents / Method|Inner Product|Faiss (nlist = 100)|Faiss (nlist = $\sqrt{\text{# Documents}}$)| |--|--|--|--| |1000000|5.067 $\sim$ 5.155|(6.511) + 0.853|(13.499) + 0.824| |2000000|8.796 $\sim$ 8.833|(14.209) + 1.773|(28.576) + 1.665| |3000000|12.991 $\sim$ 13.704|(20.485) + 2.592|(45.360) + 2.563| |4000000|17.470 $\sim$ 17.956|(28.419) + 3.379|(63.009) + 4.574| |5000000|21.514 $\sim$ 22.932|(33.143) + 10.499|Shutdown| Search 10 documents spend time (second) |# Documents / Method|Inner Product|Faiss (nlist = 100)|Faiss (nlist = $\sqrt{\text{# Documents}}$)| |--|--|--|--| |1000000|45.986|(6.420) + 8.583|(13.516) + 8.579| |2000000|89.593|(13.318) + 17.162|(28.790) + 17.132| |3000000|135.469|(18.939) + 25.788|(45.298) + 25.501| |4000000|178.897|(26.638) + 34.471|(62.736) + 34.752| <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTc3skIQdklVh2EAMMtEj68mkuMcmU1Z4hZR9tT4ZM-vp6HQVuLGD9EutMwnJMHYOEjeaJvjTcJgMLn/pubhtml?widget=true&headers=false" width="100%" height="640"></iframe> ## Combination ### combine faiss to our system - finish code - some bug - 10^7 segment to run faiss need very large Ram (>64GB) - solution - reduce training data ## https://docs.google.com/document/d/1NXbdYwY4hvqgnWiy3xmD8rOioOd0XsmtAgUvYBygTqM/edit # 12/4 ## Finish combination start training | epoch | document segment (512 token)|data sample|spilt size|topk | device | lr | batch size | lr_scher |optimizer | final loss| training time | train EM acc |train token acc | -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |--------|--------|--------|--------|-------- | 3 |2*10^5| 10^5 |0.95 /0.05 |1 | 4090 | 1e-5 | 2| None | AdamW(weight_decay=0.01,) |0.927|12hr| 0.2981|0.7799 | - result | method |validation EM acc | bert score | | -------- | -------- | -------- | | Ours | 0.2636 | 0.6717 | | No retrieval (only llama) | 0.0400 | 0.4106 | [Question Answering on Natural Questions (Benchmark2)](https://paperswithcode.com/sota/question-answering-on-natural-questions) ![image](https://hackmd.io/_uploads/ry2bMJ5Hp.png) ![image](https://hackmd.io/_uploads/HJceXkqr6.png) ![image](https://hackmd.io/_uploads/SkVfmyqra.png) ## TODO - compare state-of-the-art method - prefix tuning - ... - ablation studies - contreiver no finetune - ... # NEXT update ## Retrain contreiver / knowledge encoder and llama #### contriever training data有包含到knowledge encoder and llama的validation data #### random_seed=708 - contriever | epoch | device |data sample|spilt size| lr | batch size | lr_scher |optimizer | final loss| training time |validation top5 acc | -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |-------- |-------- | 5 | 4090 |98708| 0.95 /0.05 | 3e-5 | 32| lr_scher=torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=20, cooldown=20, min_lr=1e-5)| AdamW(model.parameters(), lr=3e-5, weight_decay=1e-2)|0.0472| 23*5 min|0.9091 - knowledge encoder and llama  | epoch | document segment (512 token)|data sample|spilt size|topk |head | device | lr | batch size | lr_scher |optimizer | final loss| training time | train EM acc |train token acc | -------- | -------- | -------- | -------- | -------- | -------- |-------- |-------- |-------- |--------|--------|--------|--------|--------|-------- | 6 |2*10^5| 98708|0.95 /0.05 |1 |2| 4090 | 2e-5 | 2| None | AdamW(weight_decay=0.01,) | 0.897|8*6hr|0.3173 |0.7882 | - result | method |validation EM acc | bert score | | -------- | -------- | -------- | | Ours | 0.2620 | 0.6673 | | No retrieval (only llama) | 0.0286 | 0.3999 | ## Temporary Metric **BLEURT: a Transfer Learning-Based Metric for Natural Language Generation** ## idea for novelty 1. 志軒: 必須加上segments self supervised learning在retriever上 2. hallucination metric on model generation(for inference) 3. hallucination metric on model output distribution(for training with teacher forcing) 4. make it differentiable??? ## system optimization 1. 考慮減少dim 及 segments length 2. knowledge encoder architecture modify, because 10 prefix vectors is enough # [Overleaf link](https://www.overleaf.com/7754299761tmdpwkxcbkdf#0b5d95) # [NLG PPT link](https://nycu1-my.sharepoint.com/:p:/g/personal/present90308_ee11_m365_nycu_edu_tw/EbD5ocAF2AZNgqyfzxriSpABEF72y75poGMSj8sDlIsgOw?rtime=U8cmeyz620g&nav=eyJzSWQiOjI1NiwiY0lkIjoxMDk4NTcyMjJ9) ## current system implement detail ### Pre-train ### doc build

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.