Code Repo Using LangChain, OpenAI and DeepLake

參考兩個實作影片，完成兩個實作 1. [Chat with Your Code Repo Using ChatGPT **(youtube)**](https://www.youtube.com/watch?v=AmJGYu0U1L8&t=150s) [Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Activeloop's Deep Lake **(code)**](https://python.langchain.com/docs/use_cases/code/twitter-the-algorithm-analysis-deeplake) 2. [Read Any Github Repo with LangChain + OpenAI **(youtube)**](https://www.youtube.com/watch?v=gz9-QhpwYTs) [RepoReader **(github)**](https://github.com/cmooredev/RepoReader) # 實作1. 使用.ipynb file進行以下實作↓ 安裝需要的package ``` !python3 -m pip install --upgrade langchain deeplake openai tiktoken ``` [OpenAI api key](https://platform.openai.com/account/api-keys) ![](https://hackmd.io/_uploads/S1gHhS1on.png) [Activeloop token](https://app.activeloop.ai/profile/pu/apitoken) ![](https://hackmd.io/_uploads/Sy1Oirkj3.png) ![](https://hackmd.io/_uploads/rJNcoHyon.png) ``` import os import openai from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import DeepLake from deeplake.core.vectorstore import VectorStore os.environ['OPENAI_API_KEY'] = '......' # 填入api key(string) os.environ['ACTIVELOOP_TOKEN'] = '......' # 填入token (string) embeddings = OpenAIEmbeddings() ``` 啟動activeloop token `!activeloop login -t ...... # 填入和上述一樣的token` ![](https://hackmd.io/_uploads/Bylj181i2.png) [twitter/the-algorithm](https://github.com/twitter/the-algorithm) 以此github為範例，clone此folder放在同一層的位置以供讀取(下方的root_dir) ``` from langchain.document_loaders import TextLoader root_dir = './the-algorithm-main' docs = [] for dirpath, dirnames, filenames in os.walk(root_dir): for file in filenames: try: loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8') docs.extend(loader.load_and_split()) except Exception as e: pass ``` chunk the files ``` from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(docs) ``` ![](https://hackmd.io/_uploads/HJy918Jon.png) Compute embeddings and upload to Activeloop. You can then publish the dataset to be public. ``` db = DeepLake(embedding_function=embeddings, dataset_path="hub://自己DeepLake organization的名字/自己取的名字", public=True) db.add_documents(texts) ``` ![](https://hackmd.io/_uploads/SkVse8yi3.png) 成功後可至DeepLake dataset查看，**也可以直接在此頁面利用NLP進行互動式操作*** (參考頁面中的Example Queries問答方式) ![](https://hackmd.io/_uploads/H1JtWIJin.png) **以下是利用code進行相關問答：** ``` retriever = db.as_retriever() retriever.search_kwargs["distance_metric"] = "cos" # cosine similarity, use to calculate the distance between two vectors retriever.search_kwargs["fetch_k"] = 100 # how many maximum code chunks to fetch retriever.search_kwargs["maximal_marginal_relevance"] = True # The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in re- ranking retrieved documents and in selecting appropriate passages for text summarization. retriever.search_kwargs["k"] = 10 # top k results to return ``` ``` def filter(x): # filter based on source code if "com.google" in x["text"].data()["value"]: return False # filter based on path e.g. extension metadata = x["metadata"].data()["value"] return "scala" in metadata["source"] or "py" in metadata["source"] ### turn on below for custom filtering # retriever.search_kwargs['filter'] = filter ``` ``` from langchain.chat_models import ChatOpenAI from langchain.chains import ConversationalRetrievalChain model = ChatOpenAI(model_name="gpt-3.5-turbo") # 可以switch to 'gpt-4' qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever) ``` ``` questions = [ #以下為輸入的問題 "What does favCountParams do?", "is it Likes + Bookmarks, or not clear from the code?", "What are the major negative modifiers that lower your linear ranking parameters?", "How do you get assigned to SimClusters?", "What is needed to migrate from one SimClusters to another SimClusters?", "How much do I get boosted within my cluster?", "How does Heavy ranker work. what are it’s main inputs?", "How can one influence Heavy ranker?", "why threads and long tweets do so well on the platform?", "Are thread and long tweet creators building a following that reacts to only threads?", "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?", "Content meta data and how it impacts virality (e.g. ALT in images).", "What are some unexpected fingerprints for spam factors?", "Is there any difference between company verified checkmarks and blue verified individual checkmarks?", ] chat_history = [] for question in questions: result = qa({"question": question, "chat_history": chat_history}) chat_history.append((question, result["answer"])) print(f"-> **Question**: {question} \n") print(f"**Answer**: {result['answer']} \n") ``` ![](https://hackmd.io/_uploads/HJSBm8Jih.png) # 實作2. 利用OpenAI(chatGPT)的api來進行問答 ![](https://hackmd.io/_uploads/Sy1gewkj3.png) https://github.com/cmooredev/RepoReader 更改main.py內的code，設定環境變數及api(圖片中箭頭和橫線處) ![](https://hackmd.io/_uploads/rk0YyP1o3.png) 輸入github網址 ![](https://hackmd.io/_uploads/BJAdbvJo2.png) 開始進行問答(英文中文都可) ![](https://hackmd.io/_uploads/SydgGPksn.png) (初淺嘗試) * .py檔內特定function和行數的code理解 * 執行簡單的範例步驟 ![](https://hackmd.io/_uploads/r1dafD1i2.png) ![](https://hackmd.io/_uploads/S1tCfvyo2.png) ![](https://hackmd.io/_uploads/H1kxmD1ih.png)