Large Language Models with Semantic Search

Embedding(內嵌向量/向量表示)

課程概要

什麼是內嵌向量/向量表示(embedding):將詞轉換為向量(vectors)，相似詞匯聚在一起
如何使用Cohere API生成詞和句子的embedding
使用UMAP視覺化詞和句子的embedding
在Wikipedia資料集上生成段落embedding並視覺化
embedding在語意搜索中的作用，可以用於在大型資料集中搜索詢問的答案
使用embedding在大型資料集中搜索答案即密集檢索(Dense Retrieval)的基本概念

Embedding(內嵌向量/向量表示)的概念

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Embeddings是將文字或句子映射到數值化的過程，以向量的形式表示，使得電腦更容易處理(量化)文件資料/人類語言
Embedding空間中的每個詞語都對應到一組數值坐標。在Embedding中，意義相近的詞語被映射到相近的向量。例如，水果的詞語聚在一起，車輛的詞語也聚在一起。
不僅單詞，長句子也可以通過Embedding映射到向量。相似的句子處於相近的空間位置
給定一組詞語的Embeddings後，可以根據向量的距離計算出新詞的適當位置
Embedding空間的維度很高(透過在模型訓練過程中指定維度)，一個詞語可能對應數百或數千個坐標
Embedding為單字/句子/文件相似性的計算提供了基礎

Word Embeddings

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Word Embeddings

使用`Cohere`建立文字Embeddings

Cohere是一個使用大型語言模型的API服務，可以用於生成文字、摘要、類推等自然語言處理任務。

環境設置







# !pip install cohere umap-learn altair datasets
import os
from dotenv import load_dotenv， find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])
import pandas as pd

手動建立三個單字

three_words = pd.DataFrame({'text':
  [
      'joy'，
      'happiness'，
      'potato'
  ]})
three_words

使用cohere APIco.embed 將這三個單字轉變為向量表示(embeddings)：

three_words_emb = co.embed(texts=list(three_words['text'])，
                           model='embed-english-v2.0').embeddings
word_1 = three_words_emb[0]
word_2 = three_words_emb[1]
word_3 = three_words_emb[2]

# 單詞“joy”的向量並顯示前10個元素
word_1[:10] 
# [2.3203125， -0.18334961， -0.5756836， -0.7285156， -2.2070312， -2.5957031， 0.35424805， -1.625， 0.27392578， 0.3083496]

import numpy as np 
word_embs = np.asarray(three_words_emb)
pd.DataFrame(word_embs)

3個單詞的向量表示
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Sentence Embeddings

語句嵌入不僅能對單詞生成向量表示，也能對更長的文件生成向量表示
即使兩個語句中的單詞不同，如果語意相似，它們的向量表示也會接近
舉例來說，"Hello， how are you?" 和 "Hi， how's it going?" 雖然單詞不同，但語意非常近似，它們的向量表示也會非常接近
可以通過比較向量的距離，將一個問題的向量和一組語句的向量進行匹配，從而找到最相似的語句作為問題的答案
這就是密集檢索(Dense Retrieval)的基礎方法

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Sentence Embeddings

使用Cohere建立句子Embeddings
```
sentences = pd.DataFrame({'text':
  [
   'Where is the world cup?',
   'The world cup is in Qatar',
   'What color is the sky?',
   'The sky is blue',
   'Where does the bear live?',
   'The bear lives in the the woods',
   'What is an apple?',
   'An apple is a fruit',
  ]})

sentences
```
這邊可以看到剛好是4組對應的QA問答組合
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
下一步將其轉換為向量表示
```
emb = co.embed(texts=list(sentences['text']),
               model='embed-english-v2.0').embeddings

import numpy as np 
sentence_emb = np.asarray(emb)  
pd.DataFrame(sentence_emb)
```
- 8個句子(4組QA)的向量表示(注意columns是到4096喔)
  Image Not Showing Possible Reasons
  - The image was uploaded to a note which you don't have access to
  - The note which the image was originally uploaded to has been deleted
  Learn More →
- 繪圖呈現後可以看到語意相近的QA組合在空間上會非常相近
  - 這邊是用umap演算法(Understanding UMAP)將4096維的高維向量降維至2維後再繪製於平面空間
Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →
Sentence Embeddings umap_plot
密集檢索Dense Retrieval

句子的embedding可以用於問答系統中，通過搜索與問題最相似的句子作為答案。也可以用於大規模文件集合的檢索中，即密集檢索(dense retrieval)

Articles Embeddings

使用Cohere建立文章Embeddings
- 使用包含2000篇文章的大規模Wikipedia數據集生成文章Embeddings
  - 這邊提供已經轉好的範例檔案('wikipedia.pkl')直接讀取
```
import pandas as pd
wiki_articles = pd.read_pickle('wikipedia.pkl')
print(wiki_articles.shape)          # (2000, 9)
print(wiki_articles.columns.values) # ['id' 'title' 'text' 'url' 'wiki_id' 'views' 'paragraph_id' 'langs' 'emb']
cols = ['title','wiki_id', 'views', 'emb', 'text']
wiki_articles[cols].head(3)
```
- Embeddings在emb欄位
  Image Not Showing Possible Reasons
  - The image was uploaded to a note which you don't have access to
  - The note which the image was originally uploaded to has been deleted
  Learn More →
```
articles = wiki_articles[['title', 'text']]
embeds = np.array([d for d in wiki_articles['emb']])
print(embeds.shape) # (2000, 768)
chart = umap_plot_big(articles, embeds)
chart.interactive()
```
- 將wiki百科的2000筆文章使用umap從768維降至2維後繪於2D平面圖上
  - 在視覺化的2DEmbeddings空間中，相似的文章Embeddings會聚在一起
  - 可以觀察到語言、國家、君主、足球運動員、藝術家等話題的文章聚在不同的區域
  - 可以通過探索Embeddings圖表找到不同話題的文章分布
  - Embeddings能夠將大規模文件資料視覺化並表現出語意關聯性
    Image Not Showing Possible Reasons
    The image was uploaded to a note which you don't have access to
    The note which the image was originally uploaded to has been deleted
    Learn More →

Deeplearning.ai GenAI/LLM系列課程筆記

Large Language Models with Semantic Search。大型語言模型與語意搜索

Finetuning Large Language Models。微調大型語言模型

Large Language Models with Semantic Search

Embedding(內嵌向量/向量表示)

課程概要

Embedding(內嵌向量/向量表示)的概念

Word Embeddings

使用`Cohere`建立文字Embeddings

Sentence Embeddings

Articles Embeddings

Deeplearning.ai GenAI/LLM系列課程筆記

Large Language Models with Semantic Search。大型語言模型與語意搜索

Finetuning Large Language Models。微調大型語言模型

Large Language Models with Semantic Search

Embedding(內嵌向量/向量表示)

課程概要

Embedding(內嵌向量/向量表示)的概念

Word Embeddings

使用Cohere建立文字Embeddings

Sentence Embeddings

Articles Embeddings

Read more

[GenAI][AI Agents] Long-Term Agentic Memory With LangGraph - Baseline Email Assistant

[GenAI][AI Agents] Long-Term Agentic Memory With LangGraph - Introduction to Agent Memory

[AI Agents in LangGraph](https://learn.deeplearning.ai/courses/ai-agents-in-langgraph/lesson/1/introduction)

AI / ML領域相關學習筆記入口頁面

使用`Cohere`建立文字Embeddings