Try   HackMD

Deeplearning.ai GenAI/LLM系列課程筆記

Large Language Models with Semantic Search。大型語言模型與語意搜索

Finetuning Large Language Models。微調大型語言模型


Large Language Models with Semantic Search

Embedding(內嵌向量/向量表示)

課程概要

  • 什麼是內嵌向量/向量表示(embedding):將詞轉換為向量(vectors),相似詞匯聚在一起
  • 如何使用Cohere API生成詞和句子的embedding
  • 使用UMAP視覺化詞和句子的embedding
  • 在Wikipedia資料集上生成段落embedding並視覺化
  • embedding在語意搜索中的作用,可以用於在大型資料集中搜索詢問的答案
  • 使用embedding在大型資料集中搜索答案即密集檢索(Dense Retrieval)的基本概念

Embedding(內嵌向量/向量表示)的概念

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • Embeddings是將文字或句子映射到數值化的過程,以向量的形式表示,使得電腦更容易處理(量化)文件資料/人類語言
  • Embedding空間中的每個詞語都對應到一組數值坐標。在Embedding中,意義相近的詞語被映射到相近的向量。例如,水果的詞語聚在一起,車輛的詞語也聚在一起。
  • 不僅單詞,長句子也可以通過Embedding映射到向量。相似的句子處於相近的空間位置
  • 給定一組詞語的Embeddings後,可以根據向量的距離計算出新詞的適當位置
  • Embedding空間的維度很高(透過在模型訓練過程中指定維度),一個詞語可能對應數百或數千個坐標
  • Embedding為單字/句子/文件相似性的計算提供了基礎

Word Embeddings

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →
Word Embeddings
使用Cohere建立文字Embeddings

Cohere是一個使用大型語言模型的API服務,可以用於生成文字、摘要、類推等自然語言處理任務。

  • 環境設置

    ​​​​# !pip install cohere umap-learn altair datasets ​​​​import os ​​​​from dotenv import load_dotenv, find_dotenv ​​​​_ = load_dotenv(find_dotenv()) # read local .env file ​​​​import cohere ​​​​co = cohere.Client(os.environ['COHERE_API_KEY']) ​​​​import pandas as pd
  • 手動建立三個單字

    ​​​​three_words = pd.DataFrame({'text':
    ​​​​  [
    ​​​​      'joy',
    ​​​​      'happiness',
    ​​​​      'potato'
    ​​​​  ]})
    ​​​​three_words
    
  • 使用cohere APIco.embed 將這三個單字轉變為向量表示(embeddings):

    ​​​​three_words_emb = co.embed(texts=list(three_words['text']),
    ​​​​                           model='embed-english-v2.0').embeddings
    ​​​​word_1 = three_words_emb[0]
    ​​​​word_2 = three_words_emb[1]
    ​​​​word_3 = three_words_emb[2]
    ​​​​
    ​​​​# 單詞“joy”的向量並顯示前10個元素
    ​​​​word_1[:10] 
    ​​​​# [2.3203125, -0.18334961, -0.5756836, -0.7285156, -2.2070312, -2.5957031, 0.35424805, -1.625, 0.27392578, 0.3083496]
    ​​​​
    ​​​​import numpy as np 
    ​​​​word_embs = np.asarray(three_words_emb)
    ​​​​pd.DataFrame(word_embs)
    
    • 3個單詞的向量表示
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

Sentence Embeddings

  • 語句嵌入不僅能對單詞生成向量表示,也能對更長的文件生成向量表示
  • 即使兩個語句中的單詞不同,如果語意相似,它們的向量表示也會接近
  • 舉例來說,"Hello, how are you?" 和 "Hi, how's it going?" 雖然單詞不同,但語意非常近似,它們的向量表示也會非常接近
  • 可以通過比較向量的距離,將一個問題的向量和一組語句的向量進行匹配,從而找到最相似的語句作為問題的答案
  • 這就是密集檢索(Dense Retrieval)的基礎方法
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →
Sentence Embeddings
  • 使用Cohere建立句子Embeddings

    ​​​​sentences = pd.DataFrame({'text':
    ​​​​  [
    ​​​​   'Where is the world cup?',
    ​​​​   'The world cup is in Qatar',
    ​​​​   'What color is the sky?',
    ​​​​   'The sky is blue',
    ​​​​   'Where does the bear live?',
    ​​​​   'The bear lives in the the woods',
    ​​​​   'What is an apple?',
    ​​​​   'An apple is a fruit',
    ​​​​  ]})
    
    ​​​​sentences
    

    這邊可以看到剛好是4組對應的QA問答組合

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    下一步將其轉換為向量表示

    ​​​​emb = co.embed(texts=list(sentences['text']),
    ​​​​               model='embed-english-v2.0').embeddings
    
    ​​​​import numpy as np 
    ​​​​sentence_emb = np.asarray(emb)  
    ​​​​pd.DataFrame(sentence_emb)
    
    • 8個句子(4組QA)的向量表示(注意columns是到4096喔)

      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

    • 繪圖呈現後可以看到語意相近的QA組合在空間上會非常相近

      • 這邊是用umap演算法(Understanding UMAP)將4096維的高維向量降維至2維後再繪製於平面空間
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
    Sentence Embeddings umap_plot
  • 密集檢索Dense Retrieval

    句子的embedding可以用於問答系統中,通過搜索與問題最相似的句子作為答案。也可以用於大規模文件集合的檢索中,即密集檢索(dense retrieval)

Articles Embeddings

  • 使用Cohere建立文章Embeddings

    • 使用包含2000篇文章的大規模Wikipedia數據集生成文章Embeddings
      • 這邊提供已經轉好的範例檔案('wikipedia.pkl')直接讀取
    ​​​​import pandas as pd ​​​​wiki_articles = pd.read_pickle('wikipedia.pkl') ​​​​print(wiki_articles.shape) # (2000, 9) ​​​​print(wiki_articles.columns.values) # ['id' 'title' 'text' 'url' 'wiki_id' 'views' 'paragraph_id' 'langs' 'emb'] ​​​​cols = ['title','wiki_id', 'views', 'emb', 'text'] ​​​​wiki_articles[cols].head(3)
    • Embeddings在emb欄位
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    ​​​​articles = wiki_articles[['title', 'text']] ​​​​embeds = np.array([d for d in wiki_articles['emb']]) ​​​​print(embeds.shape) # (2000, 768) ​​​​chart = umap_plot_big(articles, embeds) ​​​​chart.interactive()
    • 將wiki百科的2000筆文章使用umap從768維降至2維後繪於2D平面圖上
      • 在視覺化的2DEmbeddings空間中,相似的文章Embeddings會聚在一起
      • 可以觀察到語言、國家、君主、足球運動員、藝術家等話題的文章聚在不同的區域
      • 可以通過探索Embeddings圖表找到不同話題的文章分布
      • Embeddings能夠將大規模文件資料視覺化並表現出語意關聯性
        Image Not Showing Possible Reasons
        • The image was uploaded to a note which you don't have access to
        • The note which the image was originally uploaded to has been deleted
        Learn More →