my_first_ai - HackMD

Author: FLOCKAH

Your Own AI

Introduction

1. How AI Generates Text from Text

AI models generate text by predicting the probability of a sequence of words based on a given input. It utilizes statistical methods, learning from vast datasets to produce coherent and contextually relevant text.

2. Tokens and Tokenization

Tokens are the building blocks of text in NLP. Tokenization is the process of converting text into tokens, which helps in understanding the context or meaning of the text. ModelTokenizer tokenizes text based on the model's training, while dataset tokenization structures the data for training purposes.

3. Choosing a Model

When selecting a model for AI development, consider the "weights" which are akin to the AI's knowledge base, derived from training data. These weights, in conjunction with the model's architecture, determine the AI's capabilities. Larger models typically have more weights, equating to a greater capacity to handle complex tasks. Time has proven that architecture plays more important role, as many small models outperform bigger ones

Here's a quick cheatsheet for model sizes and their resource requirements:
Assuming we only load the weights of the model:

Model Size	Full Precision (32-bit)	Half Precision (16-bit)	Quarter Precision (8-bit)	4-Bit Precision
Model 13b	Requires 13x4 GB vRAM and 60 GB RAM	Needs 13x2 GB vRAM	13 GB vRAM	13/2 GB VRAM, 30 GB RAM
Model 7b	Requires 7x4 GB vRAM and 30 GB RAM	Needs 7x2 GB vRAM	7 GB vRAM	7/2 GB VRAM, 15 GB RAM
Model 3b	Requires 3x4 GB vRAM and 12 GB RAM	Needs 3x2 GB vRAM	3 GB vRAM	3/2 GB VRAM, 6 GB RAM

Model Size

Full Precision (32-bit)

Half Precision (16-bit)

Quarter Precision (8-bit)

4-Bit Precision

Model 13b

Requires 13x4 GB vRAM and 60 GB RAM

Needs 13x2 GB vRAM

13 GB vRAM

13/2 GB VRAM, 30 GB RAM

Model 7b

Requires 7x4 GB vRAM and 30 GB RAM

Needs 7x2 GB vRAM

7 GB vRAM

7/2 GB VRAM, 15 GB RAM

Model 3b

Requires 3x4 GB vRAM and 12 GB RAM

Needs 3x2 GB vRAM

3 GB vRAM

3/2 GB VRAM, 6 GB RAM

Larger models tend to use much more GPU assets, resulting with more accuracy, while smaller ones may provide more varied results while being less consuming. The key to choosing the right model involves balancing between model size, precision, architecture, and the quality of the training data.

Best to start from here: Hugging Face 🤗

Your Own AI: Preparation

1. Environment

AI development often utilizes cloud computing for access to high-performance GPUs

Here are widely-used cloud services for deep learning:

Microsoft Azure
Amazon AWS SageMaker
Google Colaboratory
RunPod
Paperspace

Remember to select cloud instances with more GPU resources than the minimum requirement to accommodate data size and context length, which may significantly impact GPU usage

2. Architecture

The leading training architectures as of 2023 include:

LongLoRA - Long context (100k+!) window training method. Helpful when we train the model for summarizing & writing tasks
LoRA (Low-Rank Adaptation) - Focuses on targeted module training
QLoRA - Enhances LoRA with weight freezing for optimized GPU usage
GPTQ - Quantizes the model weights, without further training or tuning, this one found its place on this list, because it is extremely helpful when we want to load bigger model for inference, without further do - or when we choose to quantize ourselves instead of relaying on QLora.
GGUF - Enables running large models on CPU through quantization

3. Dataset

A crucial part of AI training, datasets should be well-structured and focused. They typically consist of entries with input and output fields:











[
    {
        “input”: “Need a quick Python script for checking if a word is 
        a Palindrome”,
        “output”: “Sure… {code}”
    }
]

# Names of entries do not matter, they can be named "sol_1" and "gol_2"
# Can also be multiple: "input", "instruction", "output", "context", "text"
# The key is to recognize which field is 'User' (Question) and 'AI' (Answer)

The dataset should be comprehensive for the intended use.

Including diverse topics requires ample data to ensure effective training across all areas

Most common question that arises when looking for a dataset is `Does the data fit the task?` and there's no good answer for this one, except you can actually check that yourself - by running the training.

# Here is your training script
training_args = TrainingArguments(
    output_dir = ...,
    logging_dir = ...,
    evaluation_strategy = "steps",
    do_eval = True,
    eval_steps = 10, # Run 9 validation checks
    save_steps = 100, # This one will not save anyways!
    num_steps = 100,
    logging_steps = 10,
    # num_train_epochs=12, # Comment it, just for case
    # other arguments
)

This isn't a foolproof method but it's generally a good idea to run a short training and see how model performs basing on validation logs (or additional metrics) that will come out.

The data preparation is a large topic, this is why we will get back to it later...

Your Own AI: Training

1. Cloud Backup

Backing up is essential, yet often overlooked. The rationale for backups is clear - data in the cloud isn't permanent. Files typically get deleted after you disconnect from the instance, intentional or not.

Many companies provide cloud servers for backups, but some may not, or backups may not be included with your purchase. Ensure you can transfer the training files to your preferred cloud hosting, ideally at any point during training.

2. Training Script

By now, you should have selected the model, dataset, training architecture, cloud service, and the task type for your model.

Let's dive into programming without further ado:

# install necessary modules

Terminal:
    
pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf

Notebook:

!pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf

If any of the modules cannot be imported using import module_name then remove '-q' from pip install command above to disable the silenter.

































































# Import necessary modules
import GPUtil
import os
import time
import torch
import threading
import sys

from transformers import (
    AutoModelForCausalLM, # If other than llama
    LlamaForCausalLM, # Only if llama
    AutoTokenizer, # Will download ModelTokenizer for training
    BitsAndBytesConfig, # Quantization
    TrainingArguments,
    Trainer, # If not quantized
    LlamaTokenizer, # Only if llama
)

# Name of the model for training, Llama-2 as example
model_name = "meta-llama/Llama-2-13b-hf"

# In this example we will quantize the model to 4-bit architecture:

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type= "nf4", # For 4-bit quantization
        bnb_4bit_compute_dtype = compute_dtype,
        bnb_4bit_use_double_quant = False # "This flag is used for nested quantization where the quantization constants from the first quantization are quantized again" source: https://huggingface.co/docs/transformers/main_classes/quantization
)

# Now load the tokenizer

tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right' # 'left' may cause model to generate nonsense output as of 03.11.2023

model.config.pad_token_id = tokenizer.eos_token # set padding token id to tokenizer's end of sentence token
model.config.use_cache = False # Set True for inference

# Finally, download and load the model:

max_retries = 3  # Number of retries before giving up
retry_delay = 5  # Delay between retries in seconds

# Retry downloading of the model on fail
for i in range(max_retries):
    try:
        # If you don't train llama then AutoModelForCausalLM 
        model = LlamaForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map={"": 0}, # Move the training to cuda index 0 to train using single-gpu
            trust_remote_code=False, # Enable on A100 and newer GPUs for FlashAttention2 ~15-20% less GPU consumption
        )
        model.config.pretraining_tp = 2 # This presumably doubles the training accuracy
        break  # Exit the loop if the model is successfully downloaded
    except Exception as e:
        print(f"An exception occurred: {e}")
        if i < max_retries - 1:
            print(f"Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
        else:
            print("Max retries reached. Exiting.")
            raise  # Re-raise the last exception to exit the script

Once the model is downloaded successfully (which may take some time depending on its size), we can proceed with the model's further processing. We'll begin by freezing its weights, examining the internal modules, and assessing the model's trainable parameters.
















from peft import AutoPeftModelForCausalLM, LoraConfig, PeftModel, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        lora_alpha=64,
        lora_dropout=0.01,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["q_proj","v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model = PeftModel(model, peft_config)

print(model) # This will print all the modules, find Linear4bit and target them (more!=better)

model.print_trainable_parameters() # import this if doesn't work

Where:

r, the dimension of the low-rank matrices
lora_alpha, the scaling factor for the low-rank matrices
lora_dropout, the dropout probability of the LoRA layers

Hugging Face PEFT Documentation🤗

Let's proceed with a simple implementation of a GPU Memory tracker, that will inform us how much memory is model consuming in real-time
















def monitor_gpu_usage():
    while True:
        gpu = GPUtil.getGPUs()[0]
        util = gpu.memoryUtil * 100
        free = gpu.memoryFree
        used = gpu.memoryUsed
        total = gpu.memoryTotal
        sys.stdout.write('\033[K')  # Clear to the end of line
        sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB")
        sys.stdout.flush()
        time.sleep(5)

# Start GPU monitoring thread
gpu_thread = threading.Thread(target=monitor_gpu_usage)
gpu_thread.daemon = True
gpu_thread.start()

We also want to take care of loading the dataset ourselves.

Lazy loading means data is loaded (supplied to the trainer) on the fly as needed rather than all at once at the beginning. This is beneficial when dealing with large datasets that do not fit into memory
custom_data_collator is a function that takes a batch of samples from the dataset and collates them into a single batch. This is a common requirement for batch processing in PyTorch, as data needs to be gathered into a batch and converted to tensors of the same size.




























from datasets import load_dataset
from torch.utils.data import Dataset, random_split

MAX_LENGTH = 512

def custom_data_collator(batch):
    input_ids = [item[0] for item in batch]
    attn_masks = [item[1] for item in batch]
    return {
        'input_ids': torch.stack(input_ids),
        'attention_mask': torch.stack(attn_masks),
        'labels': torch.stack(input_ids)
    }

class OnTheFly(Dataset): # Lazy loading, optimizes GPU
    def __init__(self, txt_list, tokenizer):
        self.txt_list = txt_list
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.txt_list)

    def __getitem__(self, idx):
        txt = self.txt_list[idx]
        encodings_dict = self.tokenizer(txt, truncation=True, max_length=MAX_LENGTH, padding="max_length")
        input_ids = torch.tensor(encodings_dict['input_ids'])
        attn_masks = torch.tensor(encodings_dict['attention_mask'])
        return input_ids, attn_masks

To load the dataset we will utilize standard load_dataset function from datasets module, the function loads the dataset to the memory, while OnTheFly assumes it's already there and passes that to the trainer.











training_data = load_dataset("flytech/llama-python-codes-30k", split='train') # In case you use different dataset, remember to map combined fields, ['text'] in this example: .map(lambda example: {'text': example['instruction'] + ' ' + example['input'] + ' ' + example['output']})['text']
texts = training_data
# Initialize the dataset
dataset = OnTheFly(texts, tokenizer)
    
# Change these ratios as needed
train_ratio = 0.95

train_dataset, val_dataset = random_split(dataset, [int(train_ratio * len(dataset)), len(dataset) - int(train_ratio * len(dataset))])
    
# Random split will randomly split your dataset and produce training and validation dataset, feel free to completely skip this and supply trainer with different split or dataset.

3. Training

Here we will:

Use TensorBoard to display our saved logs with visually appealing metrics.
Set a batch size of 16, which will consume approximately 24GB of VRAM with the model.
Train for 4 epochs, and possibly increase if the model is still learning (as indicated by a decrease in training loss).
Enable Hugging Face checkpoints!
Use the adamw_bnb_8bit optimizer.
















































from transformers.integrations import TensorBoardCallback
training_arguments = TrainingArguments(
    output_dir="/content/Ruckus-13b-24",
    logging_dir="/content/Ruckus-13b-24",
    evaluation_strategy="steps",
    do_eval=True, # Perform evaluation (disable for no evaluation at all)
    save_total_limit=14,
    per_device_train_batch_size=16, # a
    gradient_accumulation_steps=1, # b
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    optim="adamw_bnb_8bit",
    save_steps=200, # Doesn't have to be same as eval and logging steps but it's recommended
    logging_steps=200, # same as eval steps
    learning_rate=2e-4, # learning rate is extremely important, defines how fast the model learns the data but may cause damage if not properly set
    eval_steps=200, # Perform evaluation of validation dataset every 200 optimization steps
    eval_accumulation_steps=2,
    fp16=False,  # Set to True, use bf16 if A100 and newer for little bit better outcome
    bf16=True, # Set to True only on A100 GPUs and newer (ampere)
    #max_grad_norm=1.0, # gradient normalization
    #weight_decay=0.01, # weight decay, similar to lora dropout
    #warmup_ratio=0.1, # use along with scheduler type linear or cosine
    lr_scheduler_type="constant",
    save_safetensors=True,
    push_to_hub=True,
    hub_model_id="Your-AI-Name",
    hub_token="hf_herusygkj1823saddngb118", # Fake one, but similar, create Hugging Face account and go into Settings->Access tokens
    hub_strategy="checkpoint", # enable checkpoints
    remove_unused_columns=True,
    dataloader_num_workers=4 # utilize CPU cores
)

# The total batch size is always per_device_train_batch_size × gradient_accumulation_steps (a*b)

tensorboard_callback = TensorBoardCallback() # <- Callback for pretty metrics
trainer = SFTTrainer(
        model=model,
        args=training_arguments,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        peft_config=peft_config,
        max_seq_length=512, # <- Max length of a dataset entry, every entry above will be truncated to this length (in tokens)
        tokenizer=tokenizer,
        data_collator=custom_data_collator,
        packing=False, # 
        callbacks=[tensorboard_callback]
)

Now the best part. Let's actually start the training by calling train() method on our Supervised-FineTuning Trainer, there are more Trainer classes available, we are utilizing most common one for the sake of example:







trainer.train()

# or to resume from checkpoint:
    
trainer.train(
    resume_from_checkpoint="/workspace/checkpoint-xxx"
)

After successfully training the model we can save its weights as an adapter, here's how:













# First let's save the model's weights:

output_dir = os.path.join("/workspace/MyAi-13b-4bit", "final_checkpoint")

trainer.model.save_pretrained(output_dir)

trainer.save_model("/workspace/MyAi-13b-4bit")

# safe_serialization will save model in safetensors format

trainer.model.save_pretrained(output_dir, safe_serialization=True)

# torch.save(model.state_dict(), "/workspace/MyAi-13b-4bit/MyAi-13b-4bit.pth")  # Save the model weights in .pth if you like to

Now we are going to merge trained adapter and model:














# Recycle the garbage collected in GPU
del model
torch.cuda.empty_cache()

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
# Ensure that model is not a meta tensor
if isinstance(model, torch.Tensor) and model.is_meta:
    model = model.to(device="cpu", non_blocking=True, dtype=torch.float32)

# Merge and unload the model
model = model.merge_and_unload()

# Save the merged model
model.save_pretrained(output_dir)

Don't forget to upload merged model to Hugging Face Hub, so you can easily launch the inference:



hub_name = 'MyAi-13b-4bit'
model.push_to_hub(hub_name)
tokenizer.push_to_hub(hub_name)

Your Own AI: Inference

What is an inference?

Inference refers to the process of using a trained neural network model to make predictions based on new input data. This phase follows the training, where the model has learned to recognize patterns or perform tasks. During inference, the model's weights are frozen, and it's utilized to interpret unseen data. The ultimate aim is to generalize the learning from the training phase to actual-world applications, such as language translation, image recognition, or in our case, generating text.

Inference can be performed in various ways depending on the application. Below we'll explore two methods—running a Flask application with ngrok and a while True loop that takes user input—to demonstrate how our trained model can be interacted with in real-time.

Using Flask and ngrok for Inference

!pip install flask_ngrok

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
import GPUtil
import time
import threading
import sys
from flask import Flask, request
from flask_ngrok import run_with_ngrok

# Function to monitor GPU usage
def monitor_gpu_usage():
    while True:
        gpu = GPUtil.getGPUs()[0]
        util = gpu.memoryUtil * 100
        free = gpu.memoryFree
        used = gpu.memoryUsed
        total = gpu.memoryTotal
        sys.stdout.write('\033[K')  # Clear to the end of line
        sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB")
        sys.stdout.flush()
        time.sleep(2)

# Start the GPU monitoring thread
gpu_thread = threading.Thread(target=monitor_gpu_usage)
gpu_thread.daemon = True
gpu_thread.start()

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("flytech/Ruckus-PyAssi-13b")
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained("flytech/Ruckus-PyAssi-13b", load_in_4bit=True, device_map="auto", bnb_4bit_compute_dtype=torch.float16)
model.config.use_cache = True

# Define the generation configuration
generation_config = GenerationConfig(
    temperature=0.95,
    top_p=0.92,
    # num_beams=4,
    # no_repeat_ngram_size=4,
    decoder_start_token_id=tokenizer.bos_token_id,
    do_sample=True,
    max_new_tokens=1024,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Create Flask app and apply ngrok
app = Flask(__name__)
run_with_ngrok(app)

# Define route for generation
@app.route("/generate", methods=["POST"])
def generate():
    data = request.get_json()
    prompt = data.get("prompt", "")
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    inputs = {key: tensor.to(device) for key, tensor in inputs.items()}
    outputs = model.generate(**inputs, **generation_config.to_dict())
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"generated_text": generated_text}

if __name__ == '__main__':
    app.run()

In the above snippet, we utilize Flask to create a web server and ngrok to tunnel our local server to a public URL, enabling external access. The application monitors the GPU usage in a separate thread, ensuring we have real-time updates on the resources consumed by our AI model. The /generate endpoint takes input via a POST request and uses our model to generate a text based on the prompt.

Simple Inference



























































import gc, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import GPUtil
import time
import threading
import sys

# Function to monitor GPU usage statistics
def monitor_gpu_usage():
    while True:
        gpu = GPUtil.getGPUs()[0]  # Get the first GPU
        util = gpu.memoryUtil * 100  # Calculate utilization percentage
        free = gpu.memoryFree  # Free memory in MB
        used = gpu.memoryUsed  # Used memory in MB
        total = gpu.memoryTotal  # Total memory in MB
        sys.stdout.write('\033[K')  # ANSI escape sequence to clear line
        sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB")
        sys.stdout.flush()  # Flush the standard output
        time.sleep(1)  # Wait for 1 second before the next check

# Create and start the GPU monitoring thread
gpu_thread = threading.Thread(target=monitor_gpu_usage)
gpu_thread.daemon = True  # Set the thread as a daemon
gpu_thread.start()

# Load tokenizer and model from your HF Repository
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

device = "cuda:0"  # Load on single GPU
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-7b1",
    load_in_4bit=True,  # Load the model in 4-bit
    device_map="auto",  # Automatic GPU placement
    trust_remote_code=True,  # To load with FlashAttention2
    bnb_4bit_compute_dtype=torch.float16  # Set compute data type to float16
)
model.config.use_cache = True  # Enable for inference, as was False in Training

# Infinite loop to continuously process input prompts
while True:
    text = input("Enter your prompt: ")
    inputs = tokenizer(text, return_tensors="pt")  # Tokenize the input text
    inputs = {key: tensor.to(device) for key, tensor in inputs.items()}  # Move tensors to the defined device
    outputs = model.generate(
            **inputs,
            num_beams=4,  # example beam search with 4 beams
            no_repeat_ngram_size=4,  # Avoid repeating ngrams of size 4
            decoder_start_token_id=tokenizer.bos_token_id,  # Start token for decoding
            do_sample=False,  # Disable random sampling to always get the same output for a given prompt
            max_new_tokens=160,  # Maximum number of new tokens to generate
            num_return_sequences=1,  # Return only one sequence
            eos_token_id=tokenizer.eos_token_id,  # End of sequence token ID
            pad_token_id=tokenizer.pad_token_id,  # Padding token ID
        )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)  # Decode the output tokens to string
    print(generated_text)
    del inputs, outputs  # Delete variables to free memory
    torch.cuda.empty_cache()  # Empty the CUDA cache
    gc.collect()  # Run garbage collection