RAFt: RAG and finetuning is all you need

# RAD: Retrieval and Domain specific finetuning ## Introduction We present a straightforward approach to customizing small, open-source models using fine-tuning and Retrieval-Augmented Generation (RAG) that outperforms GPT-3.5 for specialized use cases. With it, we achieved superior Q&A results of [technical documentation](https://nitro.jan.ai/docs) for a small codebase [codebase](https://github.com/janhq/nitro). In short, (1) extending a general foundation large language model (LLM) like [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) with strong math and coding, and (2) training it over a high-quality, synthetic dataset generated from the intended corpus, and (3) adding RAG capabilities, can lead to significant accuracy improvements. ## The Standard RAG Approach RAG represents a significant advancement in enabling LLMs to access and incorporate proprietary, user-specific information in their responses. Yet, when basic RAG systems rely on OpenAI's API as their foundation, they face a critical risk: the potential exposure of sensitive internal credentials to external parties. The performance limitations of foundational local LLMs, such as Llama-2 7B or Mistral 7B, are a significant concern for enterprises. A single inaccurate response can potentially damage an enterprise's reputation severely. This scenario underscores the delicate balance between leveraging cutting-edge AI technologies and ensuring the confidentiality and integrity of proprietary information. ## Understanding RAD At Jan, we acknowledge the challenges faced by current models and propose that a deeper understanding of specific domains could significantly enhance performance. Our hypothesis suggests that LLM with domain-specific knowledge fine-tuning will yield better results, especially when integrated with RAG. We came up with the approach called **RAD** (Retrieval and Domain specific finetuning) aims to leverage the synergy between specialized expertise and advanced information retrieval techniques to improve the accuracy and relevance of model responses. ### State the problem We started on our specific use case. Jan is an open-source project enjoying strong growth, but at one point we began receiving a new support ticket every minute, which quickly overwhelmed our bootstrapped resources. So, we directed our efforts toward training a model to answer user questions based on existing technical documentation. ### Using our own technical documentation Specifically, we trained Stealth 7B - a finetuned version of Mistral 7B for better reasoning capability on the [Nitro documentation](https://nitro.jan.ai/docs). For context, Nitro is the default inference engine for Jan. It’s a enterprise-ready server implementation of LlamaCPP, written in C++, with multimodal, queues, and other production-level server capabilities. It made an interesting corpus because it was rife with post-2023 technical jargon, edge cases, and poor informational layout. ### Generating training data The first step was to transform Nitro’s unstructured format into a synthetic Q&A dataset designed for [instruction tuning](https://arxiv.org/pdf/2109.01652.pdf). The text was split into chunks of 300-token segments with 30-token overlaps. This helped to avoid a [lost-in-the-middle](https://arxiv.org/abs/2307.03172) problem where LLM can’t use context efficiently to answer given questions. The chunks were then given to GPT-4 with 8k context length to generate 3800 Q&A pairs. The [training dataset](https://huggingface.co/datasets/jan-hq/nitro_binarized_v2) is available on HuggingFace. ### Domain Specific Fine-tuning Training was done with supervised finetuning (SFT) from the [Hugging Face's alignment-handbook](https://github.com/huggingface/alignment-handbook), per [Huggingface's Zephyr Beta](https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) guidelines. We used consumer-grade, dual Nvidia RTX 4090s for the training. The end-to-end training took 18 minutes. We found optimal hyperparameters in LoRA for this specific task to be `r = 256` and `alpha = 512`. This final model can be found [here on Huggingface](https://huggingface.co/jan-hq/nitro-v1.2-e3). ![image](https://hackmd.io/_uploads/SJyDTVk6p.png) **Figure 1.** The model shows its understanding of Nitro documentation. ## Improving results with RAG As an additional step, we also added [Retrieval Augmented Generation (RAG)](https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/) as an experiment parameter. A simple RAG setup was done using [Llamaindex](https://www.llamaindex.ai/) and the [bge-en-base-v1.5 embedding](https://huggingface.co/BAAI/bge-base-en-v1.5) model for efficient documentation retrieval and question-answering. You can find the RAG implementation [here](https://github.com/janhq/open-foundry/blob/main/rag-is-not-enough/rag/nitro_rag.ipynb). ## Benchmarking the Results We curated a new set of [50 multiple-choice questions](https://github.com/janhq/open-foundry/blob/main/rag-is-not-enough/rag/mcq_nitro.csv) (MCQ) based on the Nitro docs. The questions had varying levels of difficulty and had trick components that challenged the model's ability to discern misleading information. ![image](https://hackmd.io/_uploads/By9vaE1Ta.png) *Figure 4. Comparation between finetuned model and OpenAI's GPT* **Results** - GPT-3.5 with RAG: 56.7% - GPT-4 with RAG: 64.3% - Base 7B Model ([Stealth 7B](https://huggingface.co/jan-hq/stealth-v1.3)) with RAG: 47.7% - Finetuned 7B Model (Nitro 7B) with RAG: 57.8% This indicates that with task-specific training, we can improve an open-source, Small Language Model to the level of GPT-3.5 on domain knowledge. Notably, the finetuned + RAG approach also demonstrated more consistency, as indicated by its lower standard deviation. ## Conclusion We conclude that this combination of model merging + finetuning + RAG yields promise. This finding is relevant for teams and individuals that need specialized, technical small language models that need to run in resource-constrained or highly secured environments, where GPT may not be an option. Anecdotally, we’ve had some success using this model in practice to onboard new team members to the Nitro codebase. A full research report with more statistics can be found [here](https://github.com/janhq/open-foundry/blob/main/rag-is-not-enough/README.md). # References - [Catastrophic forgetting](https://arxiv.org/abs/2308.08747) - [Math specialization](https://arxiv.org/abs/2308.09583) - [Code specialization](https://arxiv.org/abs/2306.08568) - [Search specialization](https://github.com/SciPhi-AI/agent-search) - [Evol Instruct](https://github.com/nlpxucan/WizardLM) - [Lost in the middle](https://arxiv.org/abs/2307.03172) - [Instruction tuning](https://arxiv.org/pdf/2109.01652.pdf)