Blogpost Finetune Nitro

# How to fine-tune LLMs with documentation? LLM is more popular nowaday, everybody is using it in to do daily basis. The more usage the more need that people want the LLM can understand users' data. There are 2 main approaches to archieve this need which are Finetuning model or Retrieval Augemented Generation (RAG). In this blog we will investigate the first approach to see how it performs. In detail you will learn how to: 1. Create a dataset from unstructrued data 2. Fine-tune a model on [documentation](https://nitro.jan.ai/) 3. Run the model using Jan Before we get into the code, lets take a look on the technologies and methods we are going to use. ## What is LoRA? [Low-Rank Adaptation](https://arxiv.org/abs/2106.09685) (LoRA) is a method that makes fine-tuning large language models more efficient. It works by breaking down the model's large weight matrices into smaller, trainable matrices. These smaller matrices are the only parts that get updated, leaving the original weights unchanged. This approach significantly reduces the number of parameters that need training, leading to faster and less memory-intensive tuning. Essentially, LoRA is a new way of training LLMs with limited resources but with a little trade-off in performance. ## What is Langchain? [LangChain](https://www.langchain.com/) is an open-source framework designed to simplify the creation of applications powered by large language models (LLMs), such as chatbots and agents. Exacly like its name, it's chaining language for developer can easily apply advanced prompt techniques to get the most out of it. It also provides developers with a standardized interface and pre-built components, making advanced language understanding and generation more accessible. By abstracting the complexities of LLM integration, LangChain enables rapid development of intelligent, context-aware applications. Its collaborative ecosystem encourages innovation, leveraging the community's collective expertise to expand the possibilities of AI-driven solutions. ## What is Flash Attention? [Flash Attention](https://github.com/Dao-AILab/flash-attention) is an algorithm that speeds up the core attention mechanism in Transformer language models by restructuring computations. It uses techniques like tiling and recomputation to reduce the high memory costs of attention, enabling models to process longer text sequences. ## How to fine-tune LLMs with documentation? **1. Enviroment setup** ``` !pip install langchain tiktoken pandas datasets openai==0.28llama-index chromadb sentence-transformers pydantic==1.10.11 llama-cpp-python --quiet ``` To access Huggingface and using OpenAI asset we need to login into those account. We can do this by running the following: Import libaries ``` from huggingface_hub import login import openai ``` Setting up credential keys ``` openai.api_key = open_file(OPENAI_KEY_FILE) login(token=HUGGINGFACE_KEY_FILE) ``` **2. Data generation** In this example, we will use [Nitro documentation](https://nitro.jan.ai/), for more information about Nitro you can read [here](https://nitro.jan.ai/docs). The main idea here is that we will split the documentation into small parts then we feed them into LLM to help us generating QnA pairs. We define some helper functions ``` # Open file def open_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read() # Define ChatGPT completion def chatgpt_completion(messages, temp=0.5, model="gpt-4", max_tokens=4096): while True: response = openai.ChatCompletion.create(model=model, messages=messages, temperature=temp, max_tokens=max_tokens) text = response['choices'][0]['message']['content'] return text # Process Markdown Files def process_markdown_file(file_path, markdown_splitter, text_splitter): with open(file_path, 'r') as file: markdown_document = file.read() md_header_splits = markdown_splitter.split_text(markdown_document) return [chunk for split in md_header_splits for chunk in text_splitter.split_documents([split])] # Extract generated QA pairs def extract_qa_pairs(text): qa_pairs = [] current_pair = {} lines = text.split('\n') for line in lines: if line.startswith('"question": '): current_pair['question'] = line.split('"question": ')[1].strip(' ",') elif line.startswith('"answer": '): current_pair['answer'] = line.split('"answer": ')[1].strip(' ",') qa_pairs.append(current_pair) current_pair = {} return qa_pairs def extract_qa_pairs_from_response(response): try: parsed_data = json.loads(response) return parsed_data['qa_pairs'] except json.JSONDecodeError: return extract_qa_pairs(response) # Create new column in the dataset def create_message(row): return [{"content": row['question'], "role": "user"}, {"content": row['answer'], "role": "assistant"}] ``` We will use [Langchain](https://www.langchain.com/) framework. Because most of our documentation is in Markdown file so we will use the `MarkdownHeaderTextSplitter` and to control the tokens to feed into LLM for generating QnA pairs we will use `TokenTextSplitter` ``` # Chunking settings HEADERS_TO_SPLIT_ON = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")] # Main Script Logic markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=HEADERS_TO_SPLIT_ON) text_splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=30) ``` In this example, we choose the chunk size of 300 tokens to ensure when we generate QnA pairs the generation won't be out of context length of models (normally 4096). Also, we applied the overlap technique so that each chunk will also contain the context from the previous chunk to make the coherence between chunks. We also need to setup the prompt for LLM to generate the model. Here we used GPT-4 to generate QnA pairs but Local LLM can also do this with a little bit of tweaking on the system prompt. ``` You are a curious assistant. Your task is to make 10 pairs of questions and answers using the given context delimited by triple quotation marks. You are extremely critical and can ask questions at different difficulty levels. You will be more generic and unique and focus on the Nitro library (the given context is from Nitro library). And you can also answer with the code block also. Your `answer` must be detailed, comprehensive and step by step guide. Let's think step by step. It's really important to my project. Strictly follow the JSON format for output with 1 field `qa_pairs` and `question`, `answer`. ``` After setting everything up, let's generate dataset ``` all_chunks = [] for subdir, dirs, files in os.walk(ROOT_DIR): for file in files: if file.endswith('.md'): file_path = os.path.join(subdir, file) file_chunks = process_markdown_file(file_path, markdown_splitter, text_splitter) all_chunks.extend(file_chunks) master_df = pd.DataFrame(columns=['question', 'answer', 'raw']) for _ in range(3): for chunk in all_chunks: # Limiting to the first 3 for brevity conversation = [{'role': 'system', 'content': open_file(SYSTEM_FILE_PATH)}, {'role': 'user', 'content': str(chunk)}] response_verification = chatgpt_completion(conversation) qa_pairs = extract_qa_pairs_from_response(response_verification) qa_df = pd.DataFrame(qa_pairs) qa_df['raw'] = [chunk] * len(qa_df) master_df = pd.concat([master_df, qa_df], ignore_index=True) master_df.to_csv(CSV_FILE_PATH, index=False, encoding='utf-8') # Deduplication df = pd.read_csv(CSV_FILE_PATH) df_deduplicated = df.drop_duplicates() df_deduplicated['messages'] = df_deduplicated.apply(create_message, axis=1) ``` This is the Nitro QnA pairs [dataset](https://huggingface.co/datasets/jan-hq/nitro_binarized) We created the [data_gen.py](https://github.com/janhq/foundry/blob/main/experiments/stealth-on-nitro/data_gen.py) scripts which contains all the code above. **3. Finetuning** We use the [alignment-handbook](https://github.com/huggingface/alignment-handbook) from Huggingface for the training code. This is a well written library explain in detailed everything about finetuning models. It's also provide cutting edge technilogy implementation like LORA/QLoRA or Deepspeed ZeRO-3. For further information please take a look at their [repository](https://github.com/huggingface/alignment-handbook). For training set up, we try various LoRA setting to see the performance of the model. We also share our [YAML file]() for training the model. Run the following command after installing: ``` ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/nitro/sft/config_lora.yaml ``` **4. Test the model** After training the model, we can use [Jan](https://jan.ai/) to test our model locally. ![image](https://hackmd.io/_uploads/ryA3hTd5a.png) **5. Conclusions** In the blog post we learn how to fine-tune an OpenSource model using LoRA with Flash Attention on a local machine. We learnt: - Data generation using LangChain to chunk the documentation and use LLM to make a QnA pairs from unstructed data. - Finetuning the model on generated data using LoRA and Flash Attention. - Test the model using Jan. Combining all of those steps, we can train our model on every documentation to make it understand.