[Large Language Models with Semantic Search] - 引言與關鍵字搜尋Keyword/lexical Search

### [AI / ML領域相關學習筆記入口頁面](https://hackmd.io/@YungHuiHsu/BySsb5dfp) ### [Deeplearning.ai GenAI/LLM系列課程筆記](https://learn.deeplearning.ai/) #### [Large Language Models with Semantic Search。大型語言模型與語意搜索 ](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [Introduction、Keyword/lexical Search。引言與關鍵字搜尋](https://hackmd.io/@YungHuiHsu/rku-vjhZT) - [Embeddings。內嵌向量](https://hackmd.io/@YungHuiHsu/SJKORzaWp) - [Dense Retrieval。密集檢索](https://hackmd.io/@YungHuiHsu/Sk-hxS0-T) - [ReRank。重新排序](https://hackmd.io/@YungHuiHsu/HyT7uSJzT) - [Generating Answers。生成回答](https://hackmd.io/@YungHuiHsu/ry-Lv3kf6) #### [Finetuning Large Language Models。微調大型語言模型](https://hackmd.io/@YungHuiHsu/HJ6AT8XG6) #### [LangChain for LLM Application Development](https://hackmd.io/1r4pzdfFRwOIRrhtF9iFKQ) --- # Large Language Models with Semantic Search<br>大型語言模型與語意搜索 ## [課程介紹(Introduction)](https://learn.deeplearning.ai/large-language-models-semantic-search) 授課者：Andrew Ng、Jay Allamar([Hands-On Large Language Models](https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/)作者)和Luis Serrano 課程的目的是教你如何在你自己的應用中使用大型語言模型(LLMs)來進行資訊檢索 <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/HyFIZxTZ6.png" alt="DS_plugin_metadata.png" width="400"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction" target="_blank">Large Language Models with Semantic Search/introduction</a> </figcaption> </figure> </div> * 如何使用基本的關鍵字搜尋(lexical search)，這是在LLMs之前常用的搜尋方法，它根據查詢和文件之間的詞匹配程度來找到最相關的文件。 * 如何使用一種叫做重排(re-rank)的方法來改善關鍵字搜尋，這是一種利用LLMs來對檢索結果進行相關性排序的方法。 * 如何使用一種更先進的搜尋方法，叫做**密集檢索(dense retrieval)**，這是一種利用文本的語意意義(semantic meaning)來進行搜尋的方法，它使用了自然語言處理中的一種強大工具，叫做嵌入(embeddings)，它可以將每一段文本轉換成一個數字向量。語意搜尋(semantic search)就是在嵌入空間中找到與查詢最接近的文件。 * 如何對搜尋演算法進行有效的評估，以及如何使用LLMs來生成答案，這是一種先檢索再回答(retrieve and generate)的方法，它可以利用密集檢索得到的相關文件來創造出答案。本課程由Cohere和DeepLearning.ai共同製作 #### 課程大綱 * Keyword vs. Semantic Search * Ranking Responses * Embeddings * Dense Retrieval * Evaluation Methods * Search-Powered LLMs --- ## 課後感想這門課的的操作大部分使用cohere的API進行，很多細節被包在裡面，當然有些手刻的部分，例如文本切割還可以結合Langchain使用。以ML Engineer來說內容過於淺顯不過可以建立好的入門觀念、但以data scientist來說則多了一項更輕巧好使的匕首以前在cv領域訓練模型、生成Embeddings表示、計算向量距離等還是一定程度手刻(至少用到numpy、pytorch啦)，內部有很多細節設定需要考量，相對也會需要對演算法跟目的有更深層的理解，現在這些流程已經高度模組化，相信對於商業應用來說可以更便利、更快建立DEMO，現在不知道這些套件能客製到什麼程度、是否也會如keras般不好debug，學習上還是要更著重在基本的原理(深)跟建立系統性的理解(廣)，這些方便的工具就可以成為手上眾多選擇之一 >「當你手裡只有錘子，在你看起來，所有東西都像是釘子。」如果你唯一的工具是把錘子，你很容易把每件事情都當成釘子來處理。 --- ## 關鍵字/詞彙搜尋(Keyword/lexical Search) ### 概要本節介紹關鍵字搜索的基本概念和原理。先連接到Weaviate這個開源數據庫,它包含1000萬條維基百科段落記錄。然后通過實例了解關鍵字搜索的運作 - 查詢時根據文本中的共同關鍵詞計算文檔相關性分數,返回匹配度最高的結果。同時討論了關鍵字搜索的一些局限,如文本中關鍵詞完全不一致时匹配困難。語言模型可以通過理解文本語意來改進搜索引擎。後續課程將學習如何結合語言模型提升搜索的兩個階段 - (retrieval)和(re-ranking)。最后還會介紹大語言模型如何根據搜索結果進行回副生成。 - 詞彙(lexical)搜索 vs 語意(semantic)搜索 <table style="font-size:8px"> | |**詞彙(lexical)搜索**|**語意(semantic)搜索**| |-|-|-| |定義|依賴關鍵字匹配,返回包含指定關鍵字的結果|利用自然語言處理技術瞭解用戶的搜索意圖,返回與意圖相關的結果| |搜索引擎示例|(Keyword-based search engines)<br>(Google, Bing)|ANS: (Semantic search engines)<br>(Google Discover, Microsoft Bing Insights)| |匹配方式|匹配文檔中存在的確切關鍵字|匹配概念層面的意義,即使關鍵字不同也可找到相關內容| |優點|簡單直接,容易實現|可識別不同表達方式的相同意思,結果更貼近用戶意圖| |缺點|關鍵字選擇受限,可能漏掉相關資訊|更複雜,需要自然語言理解技術的支持| |應用情境|用戶搜索需求簡單明確時很適用|用戶搜索意圖可能更抽象、概念性強時更適用| </table> 整理自[2023.02。speakai.co。Lexical Search Vs Semantic Search](https://speakai.co/lexical-search-vs-semantic-search/) ### 課程範例程式碼 #### 環境設定與連接資料庫 * 這段程式碼從環境變數取得 API 金鑰，並使用它們建立連接到 Weaviate 資料庫的 client * 課程連結的DEMO資料庫是wikipedia的內容，包含100萬筆的資料、來自10種語言 ```python= # !pip install cohere # !pip install weaviate-client import os from dotenv import load_dotenv, find_dotenv # Weaviate 是一個開源的向量資料庫(source vector)database import weaviate # 這段程式碼從環境變數取得 API 金鑰, # 並使用它們建立連接到 Weaviate 資料庫的 client # 從環境變數中取出 Weaviate 的 API key,並建立一個 auth_config 對象這將用於後面連線時的驗證 auth_config = weaviate.auth.AuthApiKey( api_key=os.environ['WEAVIATE_API_KEY']) # 從 .env 檔案讀取環境變數 _ = load_dotenv(find_dotenv()) # read local .env file auth_config = weaviate.auth.AuthApiKey( api_key=os.environ['WEAVIATE_API_KEY']) # 課程連結的DEMO資料庫是wikipedia的內容 # 包含100萬筆的資料、來自10種語言 client = weaviate.Client( url=os.environ['WEAVIATE_API_URL'], auth_client_secret=auth_config, additional_headers={ "X-Cohere-Api-Key": os.environ['COHERE_API_KEY'], } ) # 完成資料庫連線 client.is_ready() # True ``` #### 實作關鍵字搜尋 Keyword Search - 使用出現的共同字數當作指標來進行排序 ![](https://hackmd.io/_uploads/S1RbdZpbp.png =400x) - `def keyword_search` - 對Weaviate資料庫執行具有篩選和分頁的關鍵字搜索，並返回指定屬性字段的結果。展示了如何利用Weaviate API進行關鍵字查詢 - `properties`中的`views`代表文章的瀏覽數，是這個維基百科數據集裡的一個自定義屬性 - 當在properties中指定"views"時,查詢返回的結果會包含一個views字段,顯示該篇維基百科文章被瀏覽的次數。 - 這可以用來按文章瀏覽量進行排序和篩選。例如可以查詢瀏覽量最高的文章,或者只查詢瀏覽量超過某個閾值的文章 :::spoiler BM25演算法補充 BM25是一種常用的關鍵字搜索演算法，主要步驟如下： * 對每個文檔計算詞頻(TF),即詞語在文檔中出現的頻率 * 對每個詞計算文檔頻率(DF),即包含該詞的文檔數量 * 根據TF和DF計算每個詞的權重,公式如下: $$ \textrm{weight} = \textrm{TF} \times \frac{(\textrm{k}_1 + 1)}{\textrm{TF} + \textrm{k}_1 \times (1 - \textrm{b} + \textrm{b} \times \frac{\textrm{DL}}{\textrm{AVDL}})} \times \log\frac{\textrm{N}}{\textrm{DF}} $$ 其中: * $k1$、$b$為經驗參數,通常取k1=1.2,b=0.75 * $DL$為文檔長度,$AVDL$為所有文檔平均長度 * $N$為文檔總數 * 將詞的權重相加得到文檔與查詢的相關分數 * 按分數對結果排序後返回 ::: ```py= def keyword_search(query, results_lang='en', # 支援繁體中文在內的多種語言 properties = ["title","url","text", "views", num_results=3): # 加入篩選條件 where_filter = { "path": ["lang"], "operator": "Equal", "valueString": results_lang } response = ( client.query.get("Articles", properties) # 使用BM25演算法進行關鍵字搜索 .with_bm25( query=query ) .with_where(where_filter) .with_limit(num_results) .do() ) result = response['data']['Get']['Articles'] return result def print_result(result): """ Print results with colorful formatting """ for i,item in enumerate(result): print(f'item {i}') for key in item.keys(): print(f"{key}:{item.get(key)}") print() print() ``` - 查詢 - 輸入query ```python= query = "What is the most viewed televised event?" keyword_search_results = keyword_search(query) print_result(keyword_search_results) ``` - 雖然文章中出現很多共同關鍵字，但排序第一的文章其實跟問題不太相關，排序第二的文章"Super Bowl XXXVIII halftime show controversy"才是比較接近使用者想要的答案 ```python= item 0 text:"The most active Gamergate supporters or "Gamergaters"..." title:Gamergate (harassment campaign) url:https://zh.wikipedia.org/wiki?curid=1147446 item 1 text:"Rolling Stone" stated Jackson's Super Bowl performance "is far and away the most famous moment in the history of the Super Bowl halftime show". title:Super Bowl XXXVIII halftime show controversy url:https://en.wikipedia.org/wiki?curid=498971 ``` ### 關鍵字搜尋的內部運作機制 Keyword Search Internals 關於Keyword Search Internals部分的演算法和執行流程,可以歸納如下: <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/SJTtzzpWT.png" alt="DS_plugin_metadata.png" width="600"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/2/keyword-search" target="_blank">Keyword Search Internals</a> </figcaption> </figure> </div> 1. 建立倒排索引(Inverted Index) * 將文檔中所有詞語提取出來 * 為每個詞建立一個條目,記錄包含該詞的文檔ID * 還記錄詞語在每個文檔中的詞頻TF 2. 查詢時 * 對查詢進行詞切分,獲取查詢詞項 * 在倒排索引中查找每個詞項,獲取相關文檔列表 * 根據文檔中詞頻TF計算詞項權重 * 計算文檔與查詢的相關分數,例如使用BM25算法 * 按分數排序後返回最相關的文檔 3. 優化 * 壓縮索引結構減少空間 * 利用缓存加速查找速度 * 分片和分散存儲提高擴展性 4. 效果提升 * 降低詞項維度,通過疊字、拼寫校正等提高匹配 * Query Expansion增加相關詞彙提高召回率 * 聯合多個匹配模型組合效果這樣通過建立倒排索引,並利用詞頻統計信息,可以實現快速的大規模關鍵詞搜索。同時還可以通過各種優化手段進一步提升效果。 ### 關鍵字匹配的局限性 Limitation of keyword matching <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/BJrmEGTbT.png" alt="DS_plugin_metadata.png" width="300"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/2/keyword-search" target="_blank">Limitation of keyword matching</a> </figcaption> </figure> </div> * 需要查詢和文檔使用足夠相近的詞彙表達,才能被匹配上 * 同義詞問題 - 不同詞語表達相同意思时匹配困難 * 多詞詞組問題 - 固定詞組被切分后匹配度降低 * 語意理解缺乏 - 無法判斷詞語間的語意關係 * 查詢擴展不足 - 無法自動补充相關詞彙進行擴展 * 匹配結果依賴詞頻統計 - 對生僻詞條目匹配弱 * 詞性、句法結構無法被利用 * 外部知識無法引入提升綜上,關鍵詞匹配存在上述缺陷,無法深入理解語意,也難以匹配語意上相關而詞彙表達不同的文本。需要引入外部知識和語意理解來改進。 ### 語言模型可同時改善兩個搜索階段 <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/SkyzBz6ZT.png" alt="DS_plugin_metadata.png" width="500"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/2/keyword-search" target="_blank">Language Models Can Improve both Search Stages</a> </figcaption> </figure> </div> 語言模型可以改進搜索引擎的兩個階段: 1. Retrieval階段: * 使用內嵌向量(embedding)表示詞彙語意(word semantics) * 通過向量相似度匹配查詢與文檔 * 改進基於關鍵詞匹配的檢索算法 2. Re-ranking階段: * 對檢索結果進行重新排名 * 基於語意理解(semantic understanding)改進相關性判定 * 結合更多信號如品質、流行度等進行排序 ## Ranking Responses <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/rJwqXeaWp.png" alt="DS_plugin_metadata.png" width="400"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction" target="_blank">Ranking Responses</a> </figcaption> </figure> </div> ## Dense Retrieval <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/rJgGExa-T.png" alt="DS_plugin_metadata.png" width="400"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction" target="_blank">Dense Retrieval </a> </figcaption> </figure> </div> ## Evaluation Methods <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/BkzoVgpba.png" alt="DS_plugin_metadata.png" width="400"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction" target="_blank">Evaluation metric </a> </figcaption> </figure> </div> ## Search-Powered LLMs <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/HkgCExTbT.png" alt="DS_plugin_metadata.png" width="400"> <figcaption> <span style="color: #3F7FBF; font-weight: bold;"> <a href="https://learn.deeplearning.ai/large-language-models-semantic-search/lesson/1/introduction" target="_blank">Search-Powered LLMs </a> </figcaption> </figure> </div> ## 學習資源 ### [Cohere LLM University](https://docs.cohere.com/docs/llmu) Cohere技術文件內的學習資源