Cohere
建立文字EmbeddingsCohere
是一個使用大型語言模型的API服務,可以用於生成文字、摘要、類推等自然語言處理任務。
環境設置
# !pip install cohere umap-learn altair datasets
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
import cohere
co = cohere.Client(os.environ['COHERE_API_KEY'])
import pandas as pd
手動建立三個單字
three_words = pd.DataFrame({'text':
[
'joy',
'happiness',
'potato'
]})
three_words
使用cohere APIco.embed
將這三個單字轉變為向量表示(embeddings):
three_words_emb = co.embed(texts=list(three_words['text']),
model='embed-english-v2.0').embeddings
word_1 = three_words_emb[0]
word_2 = three_words_emb[1]
word_3 = three_words_emb[2]
# 單詞“joy”的向量並顯示前10個元素
word_1[:10]
# [2.3203125, -0.18334961, -0.5756836, -0.7285156, -2.2070312, -2.5957031, 0.35424805, -1.625, 0.27392578, 0.3083496]
import numpy as np
word_embs = np.asarray(three_words_emb)
pd.DataFrame(word_embs)
使用Cohere
建立句子Embeddings
sentences = pd.DataFrame({'text':
[
'Where is the world cup?',
'The world cup is in Qatar',
'What color is the sky?',
'The sky is blue',
'Where does the bear live?',
'The bear lives in the the woods',
'What is an apple?',
'An apple is a fruit',
]})
sentences
這邊可以看到剛好是4組對應的QA問答組合
下一步將其轉換為向量表示
emb = co.embed(texts=list(sentences['text']),
model='embed-english-v2.0').embeddings
import numpy as np
sentence_emb = np.asarray(emb)
pd.DataFrame(sentence_emb)
8個句子(4組QA)的向量表示(注意columns是到4096喔)
繪圖呈現後可以看到語意相近的QA組合在空間上會非常相近
密集檢索Dense Retrieval
句子的embedding可以用於問答系統中,通過搜索與問題最相似的句子作為答案。也可以用於大規模文件集合的檢索中,即密集檢索(dense retrieval)
使用Cohere
建立文章Embeddings
import pandas as pd
wiki_articles = pd.read_pickle('wikipedia.pkl')
print(wiki_articles.shape) # (2000, 9)
print(wiki_articles.columns.values) # ['id' 'title' 'text' 'url' 'wiki_id' 'views' 'paragraph_id' 'langs' 'emb']
cols = ['title','wiki_id', 'views', 'emb', 'text']
wiki_articles[cols].head(3)
emb
欄位
articles = wiki_articles[['title', 'text']]
embeds = np.array([d for d in wiki_articles['emb']])
print(embeds.shape) # (2000, 768)
chart = umap_plot_big(articles, embeds)
chart.interactive()