[Large Language Models with Semantic Search] - ReRank<br>結合大型語言模型與語義搜索_課程筆記-重新排序

### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) #### [Large Language Models with Semantic Search。大型語言模型與語意搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [Introduction、Keyword/lexical Search。引言與關鍵字搜尋](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [ Embeddings。內嵌向量](https://hackmd.io/@YungHuiHsu/SJKORzaWp) - [Dense Retrieval。密集檢索](https://hackmd.io/@YungHuiHsu/Sk-hxS0-T) - [ReRank。重新排序](https://hackmd.io/@YungHuiHsu/HyT7uSJzT) - [Generating Answers。生成回答](https://hackmd.io/@YungHuiHsu/ry-Lv3kf6) #### [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) --- ![](https://hackmd.io/_uploads/H1MGyukzT.png =800x) # Large Language Models with Semantic Search<br>大型語言模型與語意搜索 ## [ReRank]((https://learn.deeplearning.ai/large-language-models-semantic-search)) ### 課程概要 * 文件講解關鍵字搜尋、稠密檢索以及 ReRank 的工作原理 * 關鍵字搜尋容易返回不夠相關的結果 * 稠密檢索容易返回不正確的結果 * ReRank 可以給查詢和回答一個相關性分數，從而排序結果 * ReRank 可以用來改進關鍵字搜尋和稠密檢索，找到正確答案 * 如何評估搜尋系統的效果 #### 只靠語意相似性搜尋不足以得到正確答案 Dense Retrieval is also not perfect <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/Byls2Hkz6.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/4/dense-retrieval" target="_blank">Dense Retrieval is also not perfect</a> </figcaption> </figure> </div> * 稠密檢索是把查詢和文件都映射到詞向量空間，然後返回與查詢向量最接近的文件 * 但是稠密檢索只看向量空間中的“相似性”，不一定能夠返回正確答案 * 例如有一個不正確的句子向量雖然不回答問題，但可能在向量空間上接近查詢 * 這樣稠密檢索就可能返回一個不正確的結果 * 所以稠密檢索也有可能返回與查詢關聯性不高的結果 #### 解決方案：重新排序(ReRank)搜尋結果 Solution: ReRank <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/SyPsaUJMT.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/4/dense-retrieval" target="_blank">Solution: ReRank</a> </figcaption> </figure> </div> * ReRank 的目的是從檢索結果中找到最相關的答案 * ReRank 使用大型語言模型為每個查詢-文件對(QA pairs)給出一個相關性分數 * 分數高表示查詢和文件關聯性高，很可能是正確答案 * 所以可以根據 ReRank 的相關性分數對檢索結果進行排序 * 這樣就可以從稠密檢索不完美的候選集中找到最佳答案 #### 如何對 Rerank 進行訓練 How Rerank gets trained | | ReRank is trained on lots of QA pairs | | ------- |:---------------------------------------------------:| | correct | ![](https://hackmd.io/_uploads/BklIpS1MT.png =400x) | | wrong | ![](https://hackmd.io/_uploads/r13UTByzp.png =400x) | * ReRank Model是通過監督學習訓練好的預訓練模型 * 給定大量正確的查詢-文件對(QA pairs)，讓模型學習給出高分數 * 同時給定大量不正確的查詢-文件對，讓模型學習給出低分數 * 通過最大化正確對的分數和最小化不正確對的分數來訓練 * 經過訓練後，ReRank 就能區分查詢和文件之間的關聯性 * 從而按照相關性對檢索結果進行排序 :::info 大語言模型(LLM)的表現可以通過以下幾種主要方式進行優化: * 更大規模的預訓練:使用更多更多樣化的文本數據預訓練模型，可以提升模型的理解能力 * 微調(Fine-tuning):使用特定領域的標註數據對模型進行微調，使其更好地適應下游任務 * Prompt learning:只給模型一些示例輸入輸出，使其學會完成特定任務 * Knowledge distillation:從大型教師模型中擷取知識精華到小型學生模型中 * 模型架構優化:改進模型的架構和參數以提升表現 * 強化學習:通過與環境的互動不斷改進模型策略文章提到 ReRank 是通過給定正確和不正確的查詢-文檔對來訓練，這符合 prompt learning 只需要少量示例就可以訓練模型的特性，屬於"prompt learning"的範疇 ::: ### Improving Keyword Search with ReRank * 將關鍵字搜尋結果拿給 ReRank 模型排序，可以找到與查詢最相關的答案 * 關鍵字搜尋會找到包含詞彙相似的文件，但不一定回答問題 * ReRank 會給每個查詢-回答對一個相關性分數 * 使用 ReRank 後可以從關鍵字搜尋結果中找到正確答案 ##### 程式碼實作 - 環境設定 ```python= import os from dotenv import load_dotenv， find_dotenv _ = load_dotenv(find_dotenv()) # read local .env file import cohere co = cohere.Client(os.environ['COHERE_API_KEY']) import weaviate auth_config = weaviate.auth.AuthApiKey( api_key=os.environ['WEAVIATE_API_KEY']) client = weaviate.Client( url=os.environ['WEAVIATE_API_URL']， auth_client_secret=auth_config， additional_headers={ "X-Cohere-Api-Key": os.environ['COHERE_API_KEY']， }) ``` - 呼叫rerank model - 這邊使用cohere API提供的`co.rerank`，可以從裡面指定預訓練好的RERANKER模型 - `model = 'rerank-english-v2.0'` ```python= def rerank_responses(query， responses， num_responses=10): reranked_responses = co.rerank( model = 'rerank-english-v2.0'， query = query， documents = responses， top_n = num_responses， ) return reranked_responses ``` - 檢視rerank後的結果 - keyword_search範例 ```python= query_1 = "What is the capital of Canada?" results = keyword_search(query_1， client， properties=["text"， "title"， "url"， "views"， "lang"， "_additional {distance}"]， num_results=3) texts = [result.get('text') for result in results] reranked_text = rerank_responses(query_1， texts) import pandas as pd df_reranked = pd.DataFrame(reranked_text) df_reranked['relevance_score'] = df_reranked.apply(lambda row: row.values[0].relevance_score， axis=1) df_reranked['index'] = df_reranked.apply(lambda row: row.values[0].index， axis=1) df_reranked['document'] = df_reranked.apply(lambda row: row.values[0].document['text']， axis=1) df_reranked.iloc[:，1:] ``` - 可以看到查詢結果依據相關性分數排序了 ![](https://hackmd.io/_uploads/S1MkvD1fp.png) ### Improving Dense Retrieval with ReRank * 稠密檢索結果拿給 ReRank 模型排序，可以找到最相關的答案 * 稠密檢索會找到與查詢在詞嵌入空間最接近的回答，但不一定正確 * ReRank 從稠密檢索結果中找到正確答案 * ReRank 給每個查詢-回答對一個相關性分數，最高分的通常是正確答案 ##### 程式碼實作 - dense_retrieval範例 ```py= from utils import dense_retrieval query_2 = "Who is the tallest person in history?" results = dense_retrieval(query_2，client) texts = [result.get('text') for result in results] reranked_text = rerank_responses(query_2， texts) df_reranked = pd.DataFrame(reranked_text) # (略)同樣使用pandas整理... ``` ![](https://hackmd.io/_uploads/SyvQuvyfT.png) ### Evaluating Search/Recommendation Systems 評估搜索/推薦系統效果的常見指標 | 指標 | 優點 | 缺點 | 適用情境 | |-|-|-|-| | Mean Average Precision<br>(MAP) | - 考慮整個召回率範圍的精度<br>- 計算所有查詢的平均表現 | - 受個別查詢影響較大<br>- 對於部分相關的檢索不敏感 | - 需要知道所有檢索結果的相關程度<br>- 整體系統性能評估 | | Mean Reciprocal Rank<br>(MRR) | - 簡單直接<br>- 評估速度 | - 只考慮第一個相關結果<br>- 對後續結果不敏感 | - 關注速度<br>- 只需要知道部分相關結果 | | Normalized Discounted<br>Cumulative Gain<br>(NDCG)| - 考慮結果相關性與排名<br>- 不同位置應用折扣 | - 需要設定折扣係數<br>- 較為抽象 | - 需要結果的相關性評分<br>- 評估排序質量 | [RAG檢索評估指標筆記-RAG Triad of metrics](https://hackmd.io/@YungHuiHsu/H16Y5cdi6)) - **Mean Average Precision (MAP) 平均平均精度(MAP)** 在資訊檢索、機器學習和電腦視覺等領域，當我們要評估系統或模型的性能時，尤其是在排序問題或多標籤分類問題中，MAP是一個常用的指標。在機器學習的多標籤分類問題中，常用Precision-Recall Curve，來計算MAP。 - Precision - 預測為真且實際為真 / 預測為真 - 當給定一個排序的結果列表時(x軸可以是Recall、也可以是查詢結果)，Precision是在某個特定位置之前的正確結果的比例。 - Recall - 預測為真且實際為真 / 實際為真 <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/B1YN5d1Mp.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="Calculation of Precision， Recall and Accuracy in the confusion matrix" target="_blank">Calculation of Precision， Recall and Accuracy in the confusion matrix</a> </figcaption> </figure> </div> <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/SkJ7KOJGa.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://www.researchgate.net/publication/367251324_A_Survey_of_Advanced_Computer_Vision_Techniques_for_Sports?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ" target="_blank">Visualization of the AP metric， estimated by the area below the Precision-Recall curve at different confidence intervals.</a> </figcaption> </figure> </div> 在資訊檢索/推薦系統的MAP計算中，x軸則以"查詢結果/推薦結果"替代Recall 計算流程： * 對返回的結果清單，計算每個召回率(recall)位置下的精度（Precision） * 將精度按召回率從高到低排列，形成精度-召回率(Precision-Recall)曲線 * 計算曲線下的面積，即得到該查詢的平均精度（Average Precision， AP）在搜尋/推薦系統中的平均精度的公式可以表示為： $$ AP = \sum_{k=1}^{n} P(k) \cdot rel(k) $$ 其中：$P（k）$ 是前 $k$ 個結果的精度，$rel（k）$ 表示第 $k$ 個結果的相關性（$1$ 表示相關，$0$ 表示不相ㄍ），$n$ 是返回的總結果數将所有查詢(在物件偵測案例，則可替換為所有類別)的 AP 求平均，即得到整個系统的 MAP: $$ MAP = \frac{1}{Q} \sum_{i=1}^{Q} AP(i) $$ 其中 $Q$ 是查詢總數，$AP（i）$ 是第 $i$ 個查詢的 AP - **Mean Reciprocal Rank (MRR) 平均倒數排名(MRR)** - 計算所有查詢的倒數排名的平均值。 - 對於一個查詢，倒數排名（Reciprocal Rank）的計算方法是：若查詢的第一個相關結果出現在第k名，則該查詢的倒數排名為 1/k 如果沒有相關結果，則倒數排名為 0，即： $$ RR = \begin{cases} \frac{1}{rank} & \text{if a relevant document is retrieved} \ 0 & \text{otherwise} \end{cases} $$ 那麼平均倒數排名為所有查詢的倒數排名的平均值： $$ MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} RR(i) $$ 其中，$|Q|$是查詢總數，$RR（i）$是第i個查詢的倒數排名。 MRR的值在0到1之間，值越大表示系統能夠更快地返回相關文檔 - 計算範例假設有三個查詢，它們的第一個正確答案分別出現在第一位、第三位和第二位： 1. 查詢 1 的第一個正確答案排名為 1，倒數為 $\frac{1}{1} = 1$。 2. 查詢 2 的第一個正確答案排名為 3，倒數為 $\frac{1}{3} \approx 0.33$。 3. 查詢 3 的第一個正確答案排名為 2，倒數為 $\frac{1}{2} = 0.5$。因此，MRR 的計算為： $$ \text{MRR} = \frac{1}{3} \left(1 + \frac{1}{3} + \frac{1}{2}\right) = \frac{1}{3} \left(1 + 0.33 + 0.5\right) \approx 0.61 $$ 這個結果意味著系統在處理這三個查詢時，平均而言，第一個正確答案的排名倒數大約是 0.61。MRR 的值越接近 1，表明系統能夠更快地返回正確答案，性能越好。 - **Normalized Discounted Cumulative Gain (NDCG)** 正規化折扣累積增益 NDCG考慮了結果的相關性評分以及對排序給與權重(扣分)，如果高相關性的結果被排序到後面則會進行扣分(Discounted)。推薦搭配此篇閱讀([2021.012。衛星。知乎。NDCG排序評估指標](https://zhuanlan.zhihu.com/p/448686098)) 對於一個查詢，NDCG的計算方法是: * Cumulative Gain：每個結果都有一個相關性評分 $rel(i)$，按從高到低排列 * $p$ 表示考慮結果列表的前 p 個結果 * $rel(i)$ 表示第 i 個結果的相關度評分 $$Gain = rel({i})$$ * 對於位置 i，累積增益Cumulative Gain $CG$ 是 $$ CG = rel(1) + ... + rel(i) = \sum _{{i=1}}^{{p}}{rel(i)} $$ - CG只是單純累加相關性，不考慮回傳結果的排序，因此後面引入折扣(Discounted)因子 - Discounted Cumulative Gain：考慮排序順序的因素，使得排名靠前的item增益更高，對排名靠後的item進行折扣(Discounted) * $DCG@p$ 表示考慮前 p 個結果的折扣累積增益求和 $$ {{DCG@{{p}}}}=\sum _{{i=1}}^{{p}}{\frac {rel({{i}})}{\log _{{2}}(i+1)}} $$ - 如果相關性分數`rel（i`只有（0,1）兩種值時，$DCG@p$還有另一種表示版本 - 對每個$rel(i)$先通過 $2^{rel_i} - 1$ 進行一個轉換，這是為了擴大相關度評分的區分度 $$ { {DCG@{{p}}}}=\sum _{{i=1}}^{{p}}{\frac {2^{rel{({i})}}-1}{\log _{{2}}(i+1)}} $$ * NDCG(Normalized DCG)：標準化累積增益，使最大值等於 1，得到 NDCG 即: - IDCG 指最理想排序下的折扣累積增益（Ideal Discounted Cumulative Gain）。它代表了在當前結果集合中，如果按照結果的相關性評分從高到低完美排序，可以獲得的最高 DCG 值。 IDCG 的計算方法是： - 將當前結果集根據相關性評分 rel 從高到低排序 - 計算這個完美排序情況下的 DCG，作為 IDCG - NDCG的值在 0 到 1 之間，值越大表示結果排序越能優先返回相關文檔 $$ IDCG@n = \max_{i{\in}perm(n)}DCG@n $$ $$ NDCG@p = \frac{DCG@p}{IDCG@p} $$ - 模擬排序0-100時，Discounted因子${\log _{{2}}(i+1)}$的數值變化 - 可以看到大約在排序前15時，曲線變化較陡峭，可以幫助拉開Gain值的差距 ![](https://hackmd.io/_uploads/H1NqSn1za.png =300x) ```py= # Generate discounted values for positions from 1 to 100 positions = np.arange(1, 101) discounted_values = np.log2(positions + 1) ``` --- ## Reference #### [2023.03。Sumit Kumar。Zero and Few Shot Text Retrieval and Ranking Using Large Language Models] - InPars ![](https://hackmd.io/_uploads/rJqH3UJz6.png =600x) - Unsupervised Passage Re-ranker (UPR) ![](https://hackmd.io/_uploads/BJwV3UJzp.png) #### [2023.05。Jerry Liu。LlamaIndex。Using LLM’s for Retrieval and Reranking] ![](https://hackmd.io/_uploads/ryuojUkf6.png =800x) Two-stage retrieval pipeline: 1) Top-k embedding retrieval， then 2) LLM-based reranking #### [2022。Prakhar Mishra。paperspace.com。Prompt-based Learning Paradigm in NLP](https://blog.paperspace.com/prompt-based-learning-in-natural-language-processing/) <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/S1u2QIJGT.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://arxiv.org/pdf/2107.13586v1.pdf?ref=blog.paperspace.com" target="_blank">Pre-train， Prompt， and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing</a> </figcaption> </figure> </div> > Prompt-based learning the idea is to design input to fit the model #### [2022.04。sergio。【NLP】Prompt Learning 超强入门教程](https://www.zhihu.com/tardis/zm/art/442486331?source_id=1003) ![](https://hackmd.io/_uploads/BkqSLI1Gp.png) ![](https://hackmd.io/_uploads/BJWIUUJfT.png) #### [2023.03。Tullie Murrell。shaped.ai。Evaluating recommendation systems (mAP， MMR， NDCG)](https://www.shaped.ai/blog/evaluating-recommendation-systems-map-mmr-ndcg) - Mean Average Precision (mAP) ![](https://hackmd.io/_uploads/rycZVukz6.png =400x) - Mean Reciprocal Rank (MRR) ![](https://hackmd.io/_uploads/B11EVdJM6.png =400x) - Normalized Discounted Cumulative Gain (NDCG) ![](https://hackmd.io/_uploads/rkAEVuJMp.png =400x) ![](https://hackmd.io/_uploads/rkoSVdJfT.png =400x) #### [2021.012。衛星。知乎。NDCG排序評估指標](https://zhuanlan.zhihu.com/p/448686098)