TBDGPT Roadmap

# TBDGPT Implementation Proposal ## Intro TBDGPT is what we are calling the AI tool that can help drive up adoption of TBDs projects and at the same time relieve our engineers and DevRel work by using AI as a first responder interface and copilot tool. ## Problems - We have a lot of basic inquiries in our Discord; a decent amount of time could be saved if we have a reasonable enough GPT answering questions. - Our engineer teams are spreaded thin in their current projects, yet they need to spend lots of time helping our users - The current TBD Ask GPT in the developer.tbd.website is great, but: - It needs manual updates when new releases are created, or new guides are added to the website - It is narrowly focused in Web5, we want it to be a knowledge base for all TBD SDKs: web5, tbdex, verifiable credentials, dwns, dids, crypto etc. ## Goals - Reduce our Discord response time in the community by having an effective 1st layer self serving assistant in our website and also a Discord bot. - Increase the dev experience by having an AI bot answering their questions and assisting them on their development journey. - Expand TBDGPT to a VS Code plugin to serve as an assistant to various developer tasks: - Create tbDEX and Web5 Apps from the ground up - Debugging tbDEX and Web5 apps - Helping the Core ENG team by accelerating chore tasks: automating writing docs in code, unit tests, reviewing PRs and more. ## Summary This proposal is structured in the following sections: - [Landscape Assessment](#Landscape-Assessment): to convince you why we really need to invest in this - [Solution](#Solution): describes what is we needed to get us there - [Milestones](#Milestones): proposes an incremental roadmap of what we need to be working on to make the solution a reality ## Landscape Assessment Thinking that Development Experience is improved by AI assistants is not more an speculation, a fad, or a far-fetched future dream. Check the below use cases. ### AI Docs Assistant These are AI assistants that would live in the TBD developers website: https://developers.tbd.website **Prisma** Prisma is a major DB lib for Typescript, with 6 years old since its v1 release, a package with close to [half a million downloads a week in npmjs](https://www.npmjs.com/package/prisma). Asked a question about setting up an Index using PostgresDB, it answers with the main example pages from the docs and ***it also adds caveats from GitHub Issues***: ![image](https://hackmd.io/_uploads/SJFHmInu6.png) Source: https://www.kapa.ai/customer-stories/prisma **Snap** Snap offers a developers docs portal with an AI assistant powered by Mendable.ai: ![image](https://hackmd.io/_uploads/Hkfu3I3O6.png) Source: https://docs.snap.com/ ### AI Discord Bot These are assistants that would go in our [TBD Discord community](https://discord.gg/tbd). **Astro** Astro is also a popular webframework in JS. They implemented an AI Discord bot and these are their numbers: - ~100 support hours saved by having kapa.ai implemented (assuming avg. 20mins per question, 300 questions per month) - 20 days of end-user waiting time saved per month from instant answers (assuming avg. human response delay of 3hrs) - 85% accuracy of questions answered by kapa.ai based on user-rated feedback ![image](https://hackmd.io/_uploads/SyuGt8n_6.png) Source: https://www.kapa.ai/customer-stories/how-astro-build-uses-ai-to-handle-more-than-25-of-community-questions **NextJS** Another example from a huge web framework in JS. ![image](https://hackmd.io/_uploads/HJasbv2O6.png) ### Choosing a Copilot friendly SDK Here's an example of an user choosing the Python `pandas` lib over `polars` because their IDE copilot is more helpful: ![image](https://hackmd.io/_uploads/rynikO3dp.png) Source: https://twitter.com/disiok/status/1744569181193085053 The point here being that, aside from having a great AI assistant in our website and community channels, our products SDKs adoption would benefit if they are structured in a way that copilots can consume them: meaning great guides in our portal that can be extracted/scraped, rich documentation in our source code with great API usage examples etc. ### Existing Products While researching tools that could help us achieving our goals and solving these problems we found the ones below: - [kapa.ai](https://www.kapa.ai/): ChatGPT for your developer-facing product. (Closed Source, paid cloud tool) - [mendable.ai](https://www.mendable.ai/): Answer technical questions with AI (Closed Source, paid cloud tool) - [DocsGPT](https://docsgpt.arc53.com/): your technical documentation assistant! (Open Source) Apparently, Kapa.ai and Mendable.ai are the pioneers products on the docs assistant tool space. I've been able to find their popups in a couple of dev docs portal already. Whereas the DocsGPT seems to be a custom tool/app that we need to develop on top to ingest our TBD documents and expose the query APIs to our services. And, we are aware there are other emerging tools in the space being developed everyday since the AI landscape has tons of new tools being announced constantly. **Should we choose a mature closed source cloud product or develop our custom solution?** We've debated internally and decided to follow the custom development path in phases which is the main purpose of this proposal doc. It could be either developing a custom service from the ground up with usage of popular frameworks such as LangChain and LlamaIndex, or even building on top of DocsGPT as a OSS tool. Among the reasons for deciding for a custom development, here are the main ones: - **Default to Open**: this is part of the **TBD Operating Principles**. - **Full extensibility**: we need the freedom to be able to support any future initiative that we would like to have. (VS Code extension, GitHub Action bot reviewer/checker/auto-documenter for our dev pipelines etc.) - **Risk-free**: AI startups building wrappers around OpenAI and other AI Tools are very fragile as we've seen last year. ## Solution ### RAG Pipeline LLMs are trained on enormous bodies of data but they aren’t trained in our data. Retrieval-Augmented Generation (RAG) solves this problem by adding our data to the data LLMs already have access to. ![image](https://hackmd.io/_uploads/Bk88mKn_a.png) **Loading**: this refers to getting our data from where it lives – like the developers.tbd.website, or maybe discord threads, gh issues etc. – into our pipeline. **Indexing**: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of the data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data. **Storing**: once our data is indexed we will want to store the index, as well as other metadata, to avoid having to re-index it. (For our website data we would likely compare content hashes to reindex it when needed.) **Querying**: for any given indexing strategy there are many ways we can utilize LLMs to query, including sub-queries, multi-step queries, using agents and hybrid strategies. **Evaluating**: It's not in the above diagram but it's a critical step that we should add to our pipeline to check how effective it is relative to other strategies, or when we make changes (e.g. new SDK releases with API breaking changes). Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are. ![image](https://hackmd.io/_uploads/HJ36NK2da.png) The above diagrams were extracted from LlamaIndex, a Python lib fully specialized in RAGs: LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models. They offer tooling and abstractions for all the above steps. **Our current TBD Ask Bot already uses a RAG!** The bot that powers our Developers website portal already uses a simplistic RAG: *Loading --> Indexing --> Storing --> Querying* **Loading**: at the moment it's a manual load of the TBD concepts and examples. The documents are actually handcrafted. **Indexing**: it's also a manual process of labeling TBD and cataloging Web5 concepts into their own file with a meta description, example: - dwn_write: Show how to write to Decentralized Web Nodes (DWN). ([Source Code](https://github.com/TBD54566975/web5-chatgpt-plugin/blob/main/content/dwn_write.txt)) - api_did: Explains the did API for web5 ([Source Code](https://github.com/TBD54566975/web5-chatgpt-plugin/blob/main/content/api_did.txt)) **Storing**: it's stored in the server disk in text files that can be found [in this content folder](https://github.com/TBD54566975/web5-chatgpt-plugin/tree/main/content). **Querying**: in the current implementation it's a two steps process: 1. Asks GPT4 which documents stored on disk are related to the user query based on the file name + description. ([Source Code](https://github.com/TBD54566975/web5-chatgpt-plugin/blob/main/main.py#L103-L118)) 2. Retrieve the content from these retrieved relevant files and add as the context to the prompt to ask the user question to GPT3 ([Source Code](https://github.com/TBD54566975/web5-chatgpt-plugin/blob/main/main.py#L146-L151)). It's an amazingly simple RAG that works really well! The only problem is that we need to manually curate, index and store our knowledge. ### Data Sources Ingestion Data source is the most important aspect of our RAG, because no matter what: ***Garbage IN == Garbage OUT***! We need to define which data should be loaded to our AI assistant in order to be able to answer questions about our projects and help our users. The obvious source is the documents and guides from [our developer portal](https://developer.tbd.website) and API Reference Docs. When our [App Creator examples](https://hackmd.io/QW7FwsnMQbGMZxmqmcBJbg?view#Milestone-1%EF%B8%8F%E2%83%A3) are ready, we could ingest them as well! Also, in order to also help with relevant fresh information we could load discussions happening on GitHub Issues and PRs, and also Discord threads. And, we could capture our holistic ecosystem by loading our blog articles and YouTube video transcriptions to add more value to our community members and partners. ### What about Agents? An agent is an automated decision-maker powered by an LLM that interacts with the world via a set of tools. Agents can take an arbitrary number of steps to complete a given task, dynamically deciding on the best course of action rather than following pre-determined steps. This gives it additional flexibility to tackle more complex tasks. In our RAG pipeline they could be capable of: - Perform automated search and retrieval over different types of data - unstructured, semi-structured, and structured. - Calling any external service API in a structured fashion, and processing the response + storing it for later. **An agent needs to have a reason loop and access to tools.** Here is an example of the ReAct agent answering the question "Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?": ![image](https://hackmd.io/_uploads/Sk8At9h_6.png) ReAct is a known agent spec and a practical implementation with LangChain can be found [here](https://www.promptingguide.ai/techniques/react). There are many libs that implement agents abstractions: - [Microsoft AutoGen](https://microsoft.github.io/autogen/) - [Langchain Agents](https://python.langchain.com/docs/modules/agents/) - [LlamaIndex Data Agents](https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/root.html) - [CrewAI](https://github.com/joaomdmoura/crewAI) We don't need to incorporate agents right away but they could be a great tool to help us increase accuracy in our retrieval engine. And, projects like GPT Pilot are powered by agents and are an inspiration to our TBDGPT VS Code plugin! ***GPT Pilot is a true AI developer that writes code, debugs it, talks to you when it needs help, etc.*** ![image](https://hackmd.io/_uploads/r19ACF3_T.png) https://github.com/Pythagora-io/gpt-pilot?tab=readme-ov-file#-how-gpt-pilot-works ### Shouldn't we fine tune a LLM? Fine-tuning a LLM is not a simple task and it also has limitations compared to RAG. For example, dynamic data such as a GitHub issue could not be simply ingested in the trained data, because once it's in the LLM knowledge data we can't update it without a new training and deployment. You can see RAG as a short-lived memory added to the LLM model to answer a query. RAG is super simple to be added to any LLM, while the LLMs that we use to synthesize the query responses can be replaced. In fact, we can store and curate RAG queries answers to train a LLM and follow a hybrid approach in the future. RAGs have its own problems but it's just natural to start with it because it's way easier and small. And, then moving to a hybrid approach in the future when it makes sense. _Fine-tuning LLMs is out of the scope for this proposal_ ## Milestones The above solution and goals are ambitious but they can be achievable if we do an incremental approach breaking down the full solution that we are envisioning into small feature releases to collect small wins (and also to have quicker feedback by observing real usage). _Lettered milestones could be executed in parallel._ ### Milestone 0: Automated RAG - Implement a Data loader for our Website docs - Stripe has implemented one [here](https://github.com/run-llama/llama-hub/tree/7650795417d321b8dd51a635257061fb351ba02c/llama_hub/stripe_docs) - Index them to a vector store - Create a cronjob to execute this load/indexing process daily - Deploy a new service with APIs to serve this automated RAG - Adjust our current Ask TBD interface to talk with this new service and also add the sources citations to the response - A reference example was deployed at https://chat-tbd.tbddev.org/chat/playground and a few notes can be found in [this internal doc](https://docs.google.com/document/d/1qcs4rarIJAGDpQioPlupg3WHd07Pr1FRuPrXwRZFxZU/edit?disco=AAABDFrHebM) - Feel free to play with it, asking multiple and tricky questions about Web5, tbDEX etc. - Any feedback is appreciated! 🙏 - Couple of answered queries examples: ![image](https://hackmd.io/_uploads/SkrZ3s3da.png) ![image](https://hackmd.io/_uploads/SymXho2d6.png) --- ![image](https://hackmd.io/_uploads/r1pQhsndp.png) ![image](https://hackmd.io/_uploads/HynInohu6.png) ### Milestone 1A: Monitoring/Observability - Rate limiting - Security checks, adversarial prompt - Usage/cost tracking - Analytics - TBDGPT Admin Dashboard Panel ### Milestone 1B: Discord Bot - Use our RAG to answer Discord inquiries in threads or in discussions ### Milestone 1C: Evaluation - Automated evaluation pipeline - This is a process that should happen somewhere in our CI to check how effective our model is performing - Also gives us a baseline if the changes or tweaks we are doing to our RAG pipeline are improving or getting worse - An example: we could use GPT-4 to generate questions and evaluate our RAG responses. We usually evaluate two aspects: - **Response Evaluation:** Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines? - **Retrieval Evaluation:** Are the retrieved sources relevant to the query? - Rely on fully automated evaluation by GPT-4 might be suboptimal, I think an hybrid approach between experts (us) with the model will probably generate better results - Couple of resources: - [Evaluating - LlamaIndex Docs](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html) - [Workshop: Evaluation-Driven Development (EDD)](https://www.youtube.com/watch?v=ua93WTjIN7s) - Users feedback - Come up with a system to collect users feedback - Also consider maintainers feedback vs users feedback ### Milestone 2A: Adding more Data Sources **Pre-curated:** - SDKs Source Code - Apps Examples Source Code - Youtube Transcriptions **Requires curation:** - GH Issues and PRs - Discord threads Pre-curated is content that is probably good enough to be ingested out of the box into our pipeline because they are content produced by our team and it already has a minimum of bar in quality (eg PR reviews for source code, video editing etc.). Community discussions will happen in Discord threads and GH Issues/PRs, these are definitely valuable to have in our knowledge base, but they certainly require some sort of a curation process. ### Milestone 2B: Improving RAG - Assess results experimenting with Chunks, Retrievers, Response synthesizers, agents etc. - Tagging and labelling our Data ### Milestone 3A: VS Code Extension - IDE Copilot: similar to the mentioned GPT Pilot above - App Creator features: the app creator CLI could interact with the TBDGPT endpoints to scaffold an app ### Milestone 3B: GitHub bot - CI Docs generator - CI PR reviewer - Triage/Responder: helps triaging users ## References - [LlamaIndex RAG](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html) - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) - [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629) - [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https://arxiv.org/abs/2308.08155)