# Nitro model
This is experimental model for a POC that with only text based documentation, we can convert them into trainable data -> finetune a model to understand about the documentation
## Method
We chose SFT for this task because it's easy to approach, DPO will be later applied.
In SFT, we should have a dataset containing as many QA pairs as possible for the model to learn
Example

### 1. Data generation
This is raw data which is [Nitro repo documentation](https://github.com/janhq/nitro/tree/main/docs/docs).
The data is full of .md files. Example

So what is our strategy?
The idea here is similar to RAG where we will feed the model with context but it's for generating QA pairs.
We will use [Langchain](https://www.langchain.com/) helper functions to split text into chunks.
Example code
```python
from transformers import LlamaTokenizerFast, pipeline
import torch
import re
import csv
from langchain.text_splitter import TokenTextSplitter, MarkdownHeaderTextSplitter
import os
# Define the root directory
root_dir = "/nitro/docs/docs"
# Define headers to split on for markdown documents
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
# Initialize markdown header text splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# Initialize the text splitter for chunking
chunk_size = 300
chunk_overlap = 30
text_splitter = TokenTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap)
# Function to process and split a single markdown file
def process_markdown_file(file_path):
with open(file_path, 'r') as file:
markdown_document = file.read()
md_header_splits = markdown_splitter.split_text(markdown_document)
# Split each header section into chunks
return [chunk for split in md_header_splits for chunk in text_splitter.split_documents([split])]
# List to hold all chunks from all files
all_chunks = []
# Traverse the directory and process each markdown file
for subdir, dirs, files in os.walk(root_dir):
for file in files:
if file.endswith('.md'):
file_path = os.path.join(subdir, file)
file_chunks = process_markdown_file(file_path)
all_chunks.extend(file_chunks)
# all_chunks now contains all the chunks from all markdown files
print(f"Total chunks: {len(all_chunks)}")
# print some chunks to check
all_chunks[2]
```
Output
```
Total chunks: 127
Document(page_content='To send a single query to your chosen LLM, follow these steps: \n<div class="code-snippet-left"> \n```bash title="Nitro"\ncurl http://localhost:3928/v1/chat/completions \\\n-H "Content-Type: application/json" \\\n-d \'{\n"model": "",\n"messages": [\n{\n"role": "user",\n"content": "Hello"\n},\n]\n}\'\n\n``` \n</div> \n<div class="code-snippet-right"> \n```bash title="OpenAI"\ncurl https://api.openai.com/v1/chat/completions \\\n-H "Content-Type: application/json" \\\n-H "Authorization: Bearer $OPENAI_API_KEY" \\\n-d \'{\n"model": "gpt-3.5-turbo",\n"messages": [\n{\n"role": "user",\n"content": "Hello"\n}\n]\n}\'\n``` \n</div> \nThis command sends a request to your local LLM, querying about the winner of the 2020 World Series.', metadata={'Header 3': 'Single Request Example'})
```
With Nitro docs, we separated using `tiktoken` with each chunk will have 300 tokens and 30 tokens overlap for ensure the missing context between chunks.
In this example, we will use GPT-4 as a data generator, here is the code.
**Note:** I'm using openai version `0.28`, You will need to modify some code since OpenAI changed.
```python
import openai
def open_file(filepath):
with open(filepath, 'r', encoding='utf-8') as infile:
return infile.read()
openai.api_key = open_file('/code/rex/openai_api.txt')
def chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048):
try:
while True:
response = openai.ChatCompletion.create(model=model, messages=messages,
temperature=temp, max_tokens=max_tokens)
text = response['choices'][0]['message']['content']
return text
except openai.error.APIError as e:
sleep(30)
return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048)
except openai.error.APIConnectionError as e:
sleep(30)
return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048)
except openai.error.RateLimitError as e:
sleep(30)
return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048)
except openai.error.Timeout as e:
sleep(30)
return chatgpt4_completion(messages, temp=0.5, model="gpt-4", max_tokens=2048)
```
For the system prompt I want it will generate 10 QA pairs per request (to maximize the Ratelimit from OpenAI). Remember to ask it output in JSON format which will help a lot for future parsing content.
System prompt
```
You are a curious assistant. Your task is to make 10 pairs of questions and answers using the given context delimited by triple quotation marks. You are extremely critical and can ask questions at different difficulty levels. You will be more generic and unique and focus on the Nitro library (the given context is from Nitro library). And you can also answer with the code block also. Your `answer` must be detailed, comprehensive and step by step guide. Let's think step by step. It's really important to my project. Strictly follow the JSON format for output with 1 field `qa_pairs` and `question`, `answer`.
```
For parsing content use these helper function
```python
import json
import csv
# TEST FOR 1 SAMPLE
def extract_qa_pairs(text):
"""Extracts question-answer pairs from the given text."""
qa_pairs = []
current_pair = {}
lines = text.split('\n')
for line in lines:
if line.startswith(' "question": '):
current_pair['question'] = line.split('"question": ')[1].strip(' ",')
elif line.startswith(' "answer": '):
current_pair['answer'] = line.split('"answer": ')[1].strip(' ",')
qa_pairs.append(current_pair)
current_pair = {}
return qa_pairs
# Function to extract QA pairs from the response_verification string
def extract_qa_pairs_from_response(response):
"""Extracts question-answer pairs from the given response."""
try:
# Try to parse the response as JSON
parsed_data = json.loads(response)
return parsed_data['qa_pairs']
except json.JSONDecodeError:
# If JSON parsing fails, fallback to manual parsing
return extract_qa_pairs(response)
```
Everything is ready, let's generate dataset
```python
csv_file_path = '/code/rex/qa_pairs.csv'
master_df = pd.DataFrame(columns=['question', 'answer', 'raw']) # Initialize master DataFrame
# Loop through the conversation chunks
conversation_count = 0
for _ in range(3):
for chunk in all_chunks:
conversation = [{'role': 'system', 'content': open_file('/system.txt')},
{'role': 'user', 'content': str(chunk)}]
response_verification = chatgpt4_completion(conversation)
# Extract QA pairs from the response
qa_pairs = extract_qa_pairs_from_response(response_verification)
# Create a DataFrame for these QA pairs
qa_df = pd.DataFrame(qa_pairs)
# Duplicate the chunk text to match the number of QA pairs
qa_df['raw'] = [chunk] * len(qa_df)
# Append to the master DataFrame
master_df = pd.concat([master_df, qa_df], ignore_index=True)
print(master_df.shape)
conversation_count += 1
# Write the master DataFrame to CSV
master_df.to_csv(csv_file_path, index=False, encoding='utf-8')
# Output the total number of conversations processed and the CSV file path
conversation_count, csv_file_path
```
After that, we should do a little bit of post processing and then push to hub
```python
# Dedup
import pandas as pd
# Creating a DataFrame
df = pd.read_csv("/qa_pairs_nitro.csv")
# Number of rows before deduplication
rows_before = df.shape[0]
# Deduplicate the DataFrame
df_deduplicated = df.drop_duplicates()
# Number of rows after deduplication
rows_after = df_deduplicated.shape[0]
# Push to hub
from datasets import Dataset, load_dataset
# Function to create the 'message' column
def create_message(row):
return [
{"content": row['question'], "role": "user"},
{"content": row['answer'], "role": "assistant"}
]
# Applying the function to each row
df_deduplicated['message'] = df_deduplicated.apply(create_message, axis=1)
# Assuming df is your DataFrame with the 'message' column
messages = df_deduplicated['message'].tolist()
# Convert the list to a Hugging Face Dataset
hf_dataset = Dataset.from_dict({'messages': messages})
# split train test
split_dataset = hf_dataset.train_test_split(test_size=0.1)
# Push the dataset to your repository
repo_name = "jan-hq/nitro_binarized_v2"
# You should login via the CLI before running this script
split_dataset.push_to_hub(repo_name)
```
## Results
1. With the method above we created ~3800 samples from Nitro docs
Dataset link: https://huggingface.co/datasets/jan-hq/nitro_binarized_v2
2. Finetuning process
we used [`alignment handbook`](https://github.com/huggingface/alignment-handbook) with modification for the code
See the dashboard here: https://wandb.ai/janhq/nitro?workspace=user-jan-ai
**Table of result**
Alpha and R is the configuration for LORA training.
| Model | r | alpha | Loss | Time|
|-------------|-------|-----|-------|---|
| Nitro V1 E1 | 16 | 32 | 1.185 |3m|
| Nitro V1 E3 | 16 | 32 | 0.853 |10m|
| Nitro V1.2 E3 Qlora | 256 | 512 | 0.3123|18m|
**Note** the bigger the number of r and alpha the more parameters we will train the model for.
The image from Nitro V1.2 E3 Qlora
