# Nitro model This is experimental model for a POC that with only text based documentation, we can convert them into trainable data -> finetune a model to understand about the documentation ## Method We chose SFT for this task because it's easy to approach, DPO will be later applied. In SFT, we should have a dataset containing as many QA pairs as possible for the model to learn Example ![image](https://hackmd.io/_uploads/HJObKWhPp.png) ### 1. Data generation This is raw data which is [Nitro repo documentation](https://github.com/janhq/nitro/tree/main/docs/docs). The data is full of .md files. Example ![image](https://hackmd.io/_uploads/S1eEdbhPp.png) So what is our strategy? The idea here is similar to RAG where we will feed the model with context but it's for generating QA pairs. We will use [Langchain](https://www.langchain.com/) helper functions to split text into chunks. Example code ```python from transformers import LlamaTokenizerFast, pipeline import torch import re import csv from langchain.text_splitter import TokenTextSplitter, MarkdownHeaderTextSplitter import os # Define the root directory root_dir = "/nitro/docs/docs" # Define headers to split on for markdown documents headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] # Initialize markdown header text splitter markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) # Initialize the text splitter for chunking chunk_size = 300 chunk_overlap = 30 text_splitter = TokenTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap) # Function to process and split a single markdown file def process_markdown_file(file_path): with open(file_path, 'r') as file: markdown_document = file.read() md_header_splits = markdown_splitter.split_text(markdown_document) # Split each header section into chunks return [chunk for split in md_header_splits for chunk in text_splitter.split_documents([split])] # List to hold all chunks from all files all_chunks = [] # Traverse the directory and process each markdown file for subdir, dirs, files in os.walk(root_dir): for file in files: if file.endswith('.md'): file_path = os.path.join(subdir, file) file_chunks = process_markdown_file(file_path) all_chunks.extend(file_chunks) # all_chunks now contains all the chunks from all markdown files print(f"Total chunks: {len(all_chunks)}") # print some chunks to check all_chunks[2] ``` Output ``` Total chunks: 127 Document(page_content='To send a single query to your chosen LLM, follow these steps: \n<div class="code-snippet-left"> \n```bash title="Nitro"\ncurl http://localhost:3928/v1/chat/completions \\\n-H "Content-Type: application/json" \\\n-d \'{\n"model": "",\n"messages": [\n{\n"role": "user",\n"content": "Hello"\n},\n]\n}\'\n\n``` \n</div> \n<div class="code-snippet-right"> \n```bash title="OpenAI"\ncurl https://api.openai.com/v1/chat/completions \\\n-H "Content-Type: application/json" \\\n-H "Authorization: Bearer $OPENAI_API_KEY" \\\n-d \'{\n"model": "gpt-3.5-turbo",\n"messages": [\n{\n"role": "user",\n"content": "Hello"\n}\n]\n}\'\n``` \n</div> \nThis command sends a request to your local LLM, querying about the winner of the 2020 World Series.', metadata={'Header 3': 'Single Request Example'}) ``` With Nitro docs, we separated using `tiktoken` with each chunk will have 300 tokens and 30 tokens overlap for ensure the missing context between chunks. In this example, we will use GPT-4 as a data generator, here is the code. **Note:** I'm using openai version `0.28`, You will need to modify some code since OpenAI changed. ```python import openai def open_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read() openai.api_key = open_file('/code/rex/openai_api.txt') def chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048): try: while True: response = openai.ChatCompletion.create(model=model, messages=messages, temperature=temp, max_tokens=max_tokens) text = response['choices'][0]['message']['content'] return text except openai.error.APIError as e: sleep(30) return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048) except openai.error.APIConnectionError as e: sleep(30) return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048) except openai.error.RateLimitError as e: sleep(30) return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048) except openai.error.Timeout as e: sleep(30) return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048) ``` For the system prompt I want it will generate 10 QA pairs per request (to maximize the Ratelimit from OpenAI). Remember to ask it output in JSON format which will help a lot for future parsing content. System prompt ``` You are a curious assistant. Your task is to make 10 pairs of questions and answers using the given context delimited by triple quotation marks. You are extremely critical and can ask questions at different difficulty levels. You will be more generic and unique and focus on the Nitro library (the given context is from Nitro library). And you can also answer with the code block also. Your `answer` must be detailed, comprehensive and step by step guide. Let's think step by step. It's really important to my project. Strictly follow the JSON format for output with 1 field `qa_pairs` and `question`, `answer`. ``` For parsing content use these helper function ```python import json import csv # TEST FOR 1 SAMPLE def extract_qa_pairs(text): """Extracts question-answer pairs from the given text.""" qa_pairs = [] current_pair = {} lines = text.split('\n') for line in lines: if line.startswith(' "question": '): current_pair['question'] = line.split('"question": ')[1].strip(' ",') elif line.startswith(' "answer": '): current_pair['answer'] = line.split('"answer": ')[1].strip(' ",') qa_pairs.append(current_pair) current_pair = {} return qa_pairs # Function to extract QA pairs from the response_verification string def extract_qa_pairs_from_response(response): """Extracts question-answer pairs from the given response.""" try: # Try to parse the response as JSON parsed_data = json.loads(response) return parsed_data['qa_pairs'] except json.JSONDecodeError: # If JSON parsing fails, fallback to manual parsing return extract_qa_pairs(response) ``` Everything is ready, let's generate dataset ```python csv_file_path = '/code/rex/qa_pairs.csv' master_df = pd.DataFrame(columns=['question', 'answer', 'raw']) # Initialize master DataFrame # Loop through the conversation chunks conversation_count = 0 for _ in range(3): for chunk in all_chunks: conversation = [{'role': 'system', 'content': open_file('/system.txt')}, {'role': 'user', 'content': str(chunk)}] response_verification = chatgpt4_completion(conversation) # Extract QA pairs from the response qa_pairs = extract_qa_pairs_from_response(response_verification) # Create a DataFrame for these QA pairs qa_df = pd.DataFrame(qa_pairs) # Duplicate the chunk text to match the number of QA pairs qa_df['raw'] = [chunk] * len(qa_df) # Append to the master DataFrame master_df = pd.concat([master_df, qa_df], ignore_index=True) print(master_df.shape) conversation_count += 1 # Write the master DataFrame to CSV master_df.to_csv(csv_file_path, index=False, encoding='utf-8') # Output the total number of conversations processed and the CSV file path conversation_count, csv_file_path ``` After that, we should do a little bit of post processing and then push to hub ```python # Dedup import pandas as pd # Creating a DataFrame df = pd.read_csv("/qa_pairs_nitro.csv") # Number of rows before deduplication rows_before = df.shape[0] # Deduplicate the DataFrame df_deduplicated = df.drop_duplicates() # Number of rows after deduplication rows_after = df_deduplicated.shape[0] # Push to hub from datasets import Dataset, load_dataset # Function to create the 'message' column def create_message(row): return [ {"content": row['question'], "role": "user"}, {"content": row['answer'], "role": "assistant"} ] # Applying the function to each row df_deduplicated['message'] = df_deduplicated.apply(create_message, axis=1) # Assuming df is your DataFrame with the 'message' column messages = df_deduplicated['message'].tolist() # Convert the list to a Hugging Face Dataset hf_dataset = Dataset.from_dict({'messages': messages}) # split train test split_dataset = hf_dataset.train_test_split(test_size=0.1) # Push the dataset to your repository repo_name = "jan-hq/nitro_binarized_v2" # You should login via the CLI before running this script split_dataset.push_to_hub(repo_name) ``` ## Results 1. With the method above we created ~3800 samples from Nitro docs Dataset link: https://huggingface.co/datasets/jan-hq/nitro_binarized_v2 2. Finetuning process we used [`alignment handbook`](https://github.com/huggingface/alignment-handbook) with modification for the code See the dashboard here: https://wandb.ai/janhq/nitro?workspace=user-jan-ai **Table of result** Alpha and R is the configuration for LORA training. | Model | r | alpha | Loss | Time| |-------------|-------|-----|-------|---| | Nitro V1 E1 | 16 | 32 | 1.185 |3m| | Nitro V1 E3 | 16 | 32 | 0.853 |10m| | Nitro V1.2 E3 Qlora | 256 | 512 | 0.3123|18m| **Note** the bigger the number of r and alpha the more parameters we will train the model for. The image from Nitro V1.2 E3 Qlora ![image](https://hackmd.io/_uploads/Sy02Yz2w6.png)