Author: FLOCKAH <style> .body-md { background-color: #112131; color: #ffffff; padding: 10px; border: 2px solid #000; border-radius: 15px; } .title { color: #ffffff; font-size: 32px; text-align: center; font-family: 'Lucida Sans', 'Lucida Sans Unicode'; background-image: linear-gradient(45deg, #007bff, #0056b3, #003c7a, #002451); background-size: 200% 200%; animation: gradientShift 4s ease infinite; text-shadow: 0px 0px 10px rgba(0, 123, 200, 0.7); border-radius: 5px; box-shadow: 0 10px 20px -10px #333; padding: 5px 10px; display: block; margin: 10px auto; position: relative; transition: transform 0.3s ease-out; animation: blurShift 5s infinite; } .title::after { content: ''; position: absolute; bottom: 0; left: 7vh; display: block; width: 80%; height: 3px; background-color: #0056b3; border-radius: 5px; animation: shake 8s infinite; } @keyframes blurShift { 0% { filter: blur(1px); } 25% { filter: blur(0.5px); } 50% { filter: blur(0px); } 75% { filter: blur(0.5px); } 100% { filter: blur(1px); } } @keyframes gradientShift { 0% { background-position: 0% 50%; } 50% { background-position: 100% 50%; } 100% { background-position: 0% 50%; } } @keyframes loading { 0% { width: 0; } 50% { width: 100%; } 100% { width: 0; } } @keyframes shake { 0%, 100% { transform: translateX(0); } 10%, 30%, 50%, 70%, 90% { transform: translateX(-10px); } 20%, 40%, 60%, 80% { transform: translateX(10px); } } .shake { animation: shake 0.82s cubic-bezier(.36,.07,.19,.97) both; transform: translate3d(0, 0, 0); backface-visibility: hidden; perspective: 1000px; } .title:hover { transform: translateY(-5px); cursor: pointer; } .mini-title { color: #ace5ee; font-size: 24px; text-align: center; font-family: 'Lucida Sans', 'Lucida Sans Unicode'; background-image: linear-gradient(45deg, #0099cc, #0077a2, #005b7a, #003f5a); background-size: 200% 200%; animation: gradientShift 4s ease infinite; text-shadow: 0px 0px 8px rgba(0, 156, 204, 0.7); border-radius: 5px; box-shadow: 0 8px 16px -8px #333; padding: 4px 10px; display: block; margin: 8px auto; position: relative; transition: transform 0.3s ease-out; } .mini-title::after { content: ''; position: absolute; bottom: 0; left: 0; width: 100%; height: 2px; background-color: #0077a2; border-radius: 5px; animation: loading 3s infinite; } .mini-title:hover { transform: translateY(-3px); cursor: pointer; } .t-blue { color: #31aaf9; font-size: 18px; font-family: Lucida Console; text-align: center; padding: 2px; background-color: rgba(10,10,10,0.); display: block; } .tp-mint { color: #6fcccc; font-size: 17px; text-align: justify; padding: 20px; font-weight: bold; } .c-bleu { color: #36b1d1; font-size: 16px; text-align: justify; margin-left: 5px; } .tabu { padding: 20px; } .grid-table { border-collapse: collapse; width: 100%; } .grid-table th, .grid-table td { border: 1px solid #ddd; padding: 8px; text-align: left; color: #000; } .grid-table th { background-color: #4CAF50; color: white; } .grid-table td { background-color: #f2f2f2; } .grid-table tr:nth-child(even){background-color: #f2f2f2;} .grid-table tr:hover, .grid-table th:hover, .grid-table td:hover {background-color: #ddd;} .small-gray-text { display: block; color: gray; font-size: 0.8em; margin-top: 5px; text-decoration: underline; margin-left: 10px; text-underline-offset: 4px; } .link-hf { color: #a5b533; } </style> <div class="body-md"> <div class="title">Your Own AI</div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Introduction</div></br> <div class="tp-mint">1. How AI Generates Text from Text</div> <div class="c-bleu">AI models generate text by predicting the probability of a sequence of words based on a given input. It utilizes statistical methods, learning from vast datasets to produce coherent and contextually relevant text.</div> <div class="tp-mint">2. Tokens and Tokenization</div> <div class="c-bleu">Tokens are the building blocks of text in NLP. Tokenization is the process of converting text into tokens, which helps in understanding the context or meaning of the text. ModelTokenizer tokenizes text based on the model's training, while dataset tokenization structures the data for training purposes.</div> <div class="tp-mint">3. Choosing a Model</div> <div class="c-bleu">When selecting a model for AI development, consider the "weights" which are akin to the AI's knowledge base, derived from training data. These weights, in conjunction with the model's architecture, determine the AI's capabilities. Larger models typically have more weights, equating to a greater capacity to handle complex tasks. Time has proven that architecture plays more important role, as many small models outperform bigger ones</div></br> **Here's a quick cheatsheet for model sizes and their resource requirements:** <span class="small-gray-text">Assuming we only load the weights of the model:</span> <table class="grid-table"> <tr> <th>Model Size</th> <th>Full Precision (32-bit)</th> <th>Half Precision (16-bit)</th> <th>Quarter Precision (8-bit)</th> <th>4-Bit Precision</th> </tr> <tr> <td>Model 13b</td> <td>Requires 13x4 GB vRAM and 60 GB RAM</td> <td>Needs 13x2 GB vRAM</td> <td>13 GB vRAM</td> <td>13/2 GB VRAM, 30 GB RAM</td> </tr> <tr> <td>Model 7b</td> <td>Requires 7x4 GB vRAM and 30 GB RAM</td> <td>Needs 7x2 GB vRAM</td> <td>7 GB vRAM</td> <td>7/2 GB VRAM, 15 GB RAM</td> </tr> <tr> <td>Model 3b</td> <td>Requires 3x4 GB vRAM and 12 GB RAM</td> <td>Needs 3x2 GB vRAM</td> <td>3 GB vRAM</td> <td>3/2 GB VRAM, 6 GB RAM</td> </tr> </table> <div class="c-bleu">Larger models tend to use much more GPU assets, resulting with more accuracy, while smaller ones may provide more varied results while being less consuming. The key to choosing the right model involves balancing between model size, precision, architecture, and the quality of the training data.</br></div></br> <div class="c-bleu">Best to start from here: <a class="link-hf" href="https://huggingface.co">Hugging Face</a> <span>🤗</span> </div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Your Own AI: Preparation</div> <div class="tp-mint">1. Environment</br></div> <div class="c-bleu">AI development often utilizes cloud computing for access to high-performance GPUs</div> <div class="tabu"></br> **Here are widely-used cloud services for deep learning:** - Microsoft Azure - Amazon AWS SageMaker - Google Colaboratory - RunPod - Paperspace </div> <div class="c-bleu">Remember to select cloud instances with more GPU resources than the minimum requirement to accommodate data size and context length, which may significantly impact GPU usage</div> <div class="tp-mint">2. Architecture</div> <div class="c-bleu">The leading training architectures as of 2023 include: - LongLoRA - Long context (100k+!) window training method. Helpful when we train the model for summarizing & writing tasks - LoRA (Low-Rank Adaptation) - Focuses on targeted module training - QLoRA - Enhances LoRA with weight freezing for optimized GPU usage - GPTQ - Quantizes the model weights, without further training or tuning, this one found its place on this list, because it is extremely helpful when we want to load bigger model for inference, without further do - or when we choose to quantize ourselves instead of relaying on QLora. - GGUF - Enables running large models on CPU through quantization</div> <div class="tp-mint">3. Dataset</div> <div class="c-bleu">A crucial part of AI training, datasets should be well-structured and focused. They typically consist of entries with input and output fields:</div></br> ```python!= [ { “input”: “Need a quick Python script for checking if a word is a Palindrome”, “output”: “Sure… {code}” } ] # Names of entries do not matter, they can be named "sol_1" and "gol_2" # Can also be multiple: "input", "instruction", "output", "context", "text" # The key is to recognize which field is 'User' (Question) and 'AI' (Answer) ``` <div class="c-bleu">The dataset should be comprehensive for the intended use.</div></br> **Including diverse topics requires ample data to ensure effective training across all areas**</br> <div class="c-bleu"> Most common question that arises when looking for a dataset is `Does the data fit the task?` and there's no good answer for this one, except you can actually check that yourself - by running the training.</div></br> ```python # Here is your training script training_args = TrainingArguments( output_dir = ..., logging_dir = ..., evaluation_strategy = "steps", do_eval = True, eval_steps = 10, # Run 9 validation checks save_steps = 100, # This one will not save anyways! num_steps = 100, logging_steps = 10, # num_train_epochs=12, # Comment it, just for case # other arguments ) ``` <div class="c-bleu">This isn't a foolproof method but it's generally a good idea to run a short training and see how model performs basing on validation logs (or additional metrics) that will come out.</br></br> The data preparation is a large topic, this is why we will get back to it later... </div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Your Own AI: Training</div> <div class="tp-mint">1. Cloud Backup</div> <div class="c-bleu">Backing up is essential, yet often overlooked. The rationale for backups is clear - data in the cloud isn't permanent. Files typically get deleted after you disconnect from the instance, intentional or not.</br></br> Many companies provide cloud servers for backups, but some may not, or backups may not be included with your purchase. Ensure you can transfer the training files to your preferred cloud hosting, ideally at any point during training.</div> <div class="tp-mint">2. Training Script</div> <div class="c-bleu">By now, you should have selected the model, dataset, training architecture, cloud service, and the task type for your model.</br></br> Let's dive into programming without further ado:</div> </br> ```abc! # install necessary modules Terminal: pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf Notebook: !pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf ``` If any of the modules cannot be imported using ```import module_name``` then remove '-q' from ```pip install``` command above to disable the silenter. ```python= # Import necessary modules import GPUtil import os import time import torch import threading import sys from transformers import ( AutoModelForCausalLM, # If other than llama LlamaForCausalLM, # Only if llama AutoTokenizer, # Will download ModelTokenizer for training BitsAndBytesConfig, # Quantization TrainingArguments, Trainer, # If not quantized LlamaTokenizer, # Only if llama ) # Name of the model for training, Llama-2 as example model_name = "meta-llama/Llama-2-13b-hf" # In this example we will quantize the model to 4-bit architecture: compute_dtype = getattr(torch, "float16") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type= "nf4", # For 4-bit quantization bnb_4bit_compute_dtype = compute_dtype, bnb_4bit_use_double_quant = False # "This flag is used for nested quantization where the quantization constants from the first quantization are quantized again" source: https://huggingface.co/docs/transformers/main_classes/quantization ) # Now load the tokenizer tokenizer = LlamaTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = 'right' # 'left' may cause model to generate nonsense output as of 03.11.2023 model.config.pad_token_id = tokenizer.eos_token # set padding token id to tokenizer's end of sentence token model.config.use_cache = False # Set True for inference # Finally, download and load the model: max_retries = 3 # Number of retries before giving up retry_delay = 5 # Delay between retries in seconds # Retry downloading of the model on fail for i in range(max_retries): try: # If you don't train llama then AutoModelForCausalLM model = LlamaForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map={"": 0}, # Move the training to cuda index 0 to train using single-gpu trust_remote_code=False, # Enable on A100 and newer GPUs for FlashAttention2 ~15-20% less GPU consumption ) model.config.pretraining_tp = 2 # This presumably doubles the training accuracy break # Exit the loop if the model is successfully downloaded except Exception as e: print(f"An exception occurred: {e}") if i < max_retries - 1: print(f"Retrying in {retry_delay} seconds...") time.sleep(retry_delay) else: print("Max retries reached. Exiting.") raise # Re-raise the last exception to exit the script ``` </br> <div class="c-bleu">Once the model is downloaded successfully (which may take some time depending on its size), we can proceed with the model's further processing. We'll begin by freezing its weights, examining the internal modules, and assessing the model's trainable parameters.</div></br> ```abc= from peft import AutoPeftModelForCausalLM, LoraConfig, PeftModel, prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) peft_config = LoraConfig( lora_alpha=64, lora_dropout=0.01, r=16, bias="none", task_type="CAUSAL_LM", target_modules= ["q_proj","v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ) model = PeftModel(model, peft_config) print(model) # This will print all the modules, find Linear4bit and target them (more!=better) model.print_trainable_parameters() # import this if doesn't work ``` ``Where:`` - ``r, the dimension of the low-rank matrices`` - ``lora_alpha, the scaling factor for the low-rank matrices`` - ``lora_dropout, the dropout probability of the LoRA layers`` <div class="small-gray-text"> <a class="link-hf" href="https://huggingface.co/docs/peft/v0.6.0/en/quicktour#peftconfig">Hugging Face PEFT Documentation<span>🤗</span></a> </div> <div class="c-bleu"> </br> Let's proceed with a simple implementation of a GPU Memory tracker, that will inform us how much memory is model consuming in real-time </div></br> ```python= def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] util = gpu.memoryUtil * 100 free = gpu.memoryFree used = gpu.memoryUsed total = gpu.memoryTotal sys.stdout.write('\033[K') # Clear to the end of line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() time.sleep(5) # Start GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True gpu_thread.start() ``` </br> <div class="c-bleu"> We also want to take care of loading the dataset ourselves. - Lazy loading means data is loaded (supplied to the trainer) on the fly as needed rather than all at once at the beginning. This is beneficial when dealing with large datasets that do not fit into memory - custom_data_collator is a function that takes a batch of samples from the dataset and collates them into a single batch. This is a common requirement for batch processing in PyTorch, as data needs to be gathered into a batch and converted to tensors of the same size. ```python= from datasets import load_dataset from torch.utils.data import Dataset, random_split MAX_LENGTH = 512 def custom_data_collator(batch): input_ids = [item[0] for item in batch] attn_masks = [item[1] for item in batch] return { 'input_ids': torch.stack(input_ids), 'attention_mask': torch.stack(attn_masks), 'labels': torch.stack(input_ids) } class OnTheFly(Dataset): # Lazy loading, optimizes GPU def __init__(self, txt_list, tokenizer): self.txt_list = txt_list self.tokenizer = tokenizer def __len__(self): return len(self.txt_list) def __getitem__(self, idx): txt = self.txt_list[idx] encodings_dict = self.tokenizer(txt, truncation=True, max_length=MAX_LENGTH, padding="max_length") input_ids = torch.tensor(encodings_dict['input_ids']) attn_masks = torch.tensor(encodings_dict['attention_mask']) return input_ids, attn_masks ``` To load the dataset we will utilize standard load_dataset function from datasets module, the function loads the dataset to the memory, while OnTheFly assumes it's already there and passes that to the trainer. ```python= training_data = load_dataset("flytech/llama-python-codes-30k", split='train') # In case you use different dataset, remember to map combined fields, ['text'] in this example: .map(lambda example: {'text': example['instruction'] + ' ' + example['input'] + ' ' + example['output']})['text'] texts = training_data # Initialize the dataset dataset = OnTheFly(texts, tokenizer) # Change these ratios as needed train_ratio = 0.95 train_dataset, val_dataset = random_split(dataset, [int(train_ratio * len(dataset)), len(dataset) - int(train_ratio * len(dataset))]) # Random split will randomly split your dataset and produce training and validation dataset, feel free to completely skip this and supply trainer with different split or dataset. ``` </div> <div class="tp-mint">3. Training </div></br> <b><div class="c-bleu">Here we will:</b></br></br> 1. Use TensorBoard to display our saved logs with visually appealing metrics. 2. Set a batch size of 16, which will consume approximately 24GB of VRAM with the model. 3. Train for 4 epochs, and possibly increase if the model is still learning (as indicated by a decrease in training loss). 4. Enable Hugging Face checkpoints! 5. Use the adamw_bnb_8bit optimizer. ```python= from transformers.integrations import TensorBoardCallback training_arguments = TrainingArguments( output_dir="/content/Ruckus-13b-24", logging_dir="/content/Ruckus-13b-24", evaluation_strategy="steps", do_eval=True, # Perform evaluation (disable for no evaluation at all) save_total_limit=14, per_device_train_batch_size=16, # a gradient_accumulation_steps=1, # b per_device_eval_batch_size=16, num_train_epochs=4, optim="adamw_bnb_8bit", save_steps=200, # Doesn't have to be same as eval and logging steps but it's recommended logging_steps=200, # same as eval steps learning_rate=2e-4, # learning rate is extremely important, defines how fast the model learns the data but may cause damage if not properly set eval_steps=200, # Perform evaluation of validation dataset every 200 optimization steps eval_accumulation_steps=2, fp16=False, # Set to True, use bf16 if A100 and newer for little bit better outcome bf16=True, # Set to True only on A100 GPUs and newer (ampere) #max_grad_norm=1.0, # gradient normalization #weight_decay=0.01, # weight decay, similar to lora dropout #warmup_ratio=0.1, # use along with scheduler type linear or cosine lr_scheduler_type="constant", save_safetensors=True, push_to_hub=True, hub_model_id="Your-AI-Name", hub_token="hf_herusygkj1823saddngb118", # Fake one, but similar, create Hugging Face account and go into Settings->Access tokens hub_strategy="checkpoint", # enable checkpoints remove_unused_columns=True, dataloader_num_workers=4 # utilize CPU cores ) # The total batch size is always per_device_train_batch_size × gradient_accumulation_steps (a*b) tensorboard_callback = TensorBoardCallback() # <- Callback for pretty metrics trainer = SFTTrainer( model=model, args=training_arguments, train_dataset=train_dataset, eval_dataset=val_dataset, dataset_text_field="text", peft_config=peft_config, max_seq_length=512, # <- Max length of a dataset entry, every entry above will be truncated to this length (in tokens) tokenizer=tokenizer, data_collator=custom_data_collator, packing=False, # callbacks=[tensorboard_callback] ) ``` </br> Now the best part. Let's actually start the training by calling train() method on our Supervised-FineTuning Trainer, there are more Trainer classes available, we are utilizing most common one for the sake of example:</br> </br> ```python= trainer.train() # or to resume from checkpoint: trainer.train( resume_from_checkpoint="/workspace/checkpoint-xxx" ) ``` </div> <div class="tp-mint">After successfully training the model we can save its weights as an adapter, here's how:</div> ```pyton= # First let's save the model's weights: output_dir = os.path.join("/workspace/MyAi-13b-4bit", "final_checkpoint") trainer.model.save_pretrained(output_dir) trainer.save_model("/workspace/MyAi-13b-4bit") # safe_serialization will save model in safetensors format trainer.model.save_pretrained(output_dir, safe_serialization=True) # torch.save(model.state_dict(), "/workspace/MyAi-13b-4bit/MyAi-13b-4bit.pth") # Save the model weights in .pth if you like to ``` <div class="tp-mint">Now we are going to merge trained adapter and model: </div> ```python= # Recycle the garbage collected in GPU del model torch.cuda.empty_cache() model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16) # Ensure that model is not a meta tensor if isinstance(model, torch.Tensor) and model.is_meta: model = model.to(device="cpu", non_blocking=True, dtype=torch.float32) # Merge and unload the model model = model.merge_and_unload() # Save the merged model model.save_pretrained(output_dir) ``` <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="c-bleu"> Don't forget to upload merged model to Hugging Face Hub, so you can easily launch the inference:</div> </br> ```python= hub_name = 'MyAi-13b-4bit' model.push_to_hub(hub_name) tokenizer.push_to_hub(hub_name) ``` </br> <div class="t-blue">Your Own AI: Inference</div> </br> <b><div class="tp-mint">What is an inference?</div></b> <div class="c-bleu">Inference refers to the process of using a trained neural network model to make predictions based on new input data. This phase follows the training, where the model has learned to recognize patterns or perform tasks. During inference, the model's weights are frozen, and it's utilized to interpret unseen data. The ultimate aim is to generalize the learning from the training phase to actual-world applications, such as language translation, image recognition, or in our case, generating text.</div></br> <div class="c-bleu">Inference can be performed in various ways depending on the application. Below we'll explore two methods—running a Flask application with ngrok and a while True loop that takes user input—to demonstrate how our trained model can be interacted with in real-time.</div> ### <div class="mini-title">Using Flask and ngrok for Inference</div> ```python !pip install flask_ngrok from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig import torch import GPUtil import time import threading import sys from flask import Flask, request from flask_ngrok import run_with_ngrok # Function to monitor GPU usage def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] util = gpu.memoryUtil * 100 free = gpu.memoryFree used = gpu.memoryUsed total = gpu.memoryTotal sys.stdout.write('\033[K') # Clear to the end of line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() time.sleep(2) # Start the GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True gpu_thread.start() # Initialize tokenizer and model tokenizer = AutoTokenizer.from_pretrained("flytech/Ruckus-PyAssi-13b") device = "cuda:0" model = AutoModelForCausalLM.from_pretrained("flytech/Ruckus-PyAssi-13b", load_in_4bit=True, device_map="auto", bnb_4bit_compute_dtype=torch.float16) model.config.use_cache = True # Define the generation configuration generation_config = GenerationConfig( temperature=0.95, top_p=0.92, # num_beams=4, # no_repeat_ngram_size=4, decoder_start_token_id=tokenizer.bos_token_id, do_sample=True, max_new_tokens=1024, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) # Create Flask app and apply ngrok app = Flask(__name__) run_with_ngrok(app) # Define route for generation @app.route("/generate", methods=["POST"]) def generate(): data = request.get_json() prompt = data.get("prompt", "") inputs = tokenizer(prompt, return_tensors="pt", truncation=True) inputs = {key: tensor.to(device) for key, tensor in inputs.items()} outputs = model.generate(**inputs, **generation_config.to_dict()) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) return {"generated_text": generated_text} if __name__ == '__main__': app.run() ``` <div class="c-bleu">In the above snippet, we utilize Flask to create a web server and ngrok to tunnel our local server to a public URL, enabling external access. The application monitors the GPU usage in a separate thread, ensuring we have real-time updates on the resources consumed by our AI model. The /generate endpoint takes input via a POST request and uses our model to generate a text based on the prompt.</div></br> <div class="mini-title">Simple Inference</div> ```python= import gc, torch from transformers import AutoTokenizer, AutoModelForCausalLM import GPUtil import time import threading import sys # Function to monitor GPU usage statistics def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] # Get the first GPU util = gpu.memoryUtil * 100 # Calculate utilization percentage free = gpu.memoryFree # Free memory in MB used = gpu.memoryUsed # Used memory in MB total = gpu.memoryTotal # Total memory in MB sys.stdout.write('\033[K') # ANSI escape sequence to clear line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() # Flush the standard output time.sleep(1) # Wait for 1 second before the next check # Create and start the GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True # Set the thread as a daemon gpu_thread.start() # Load tokenizer and model from your HF Repository tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1") device = "cuda:0" # Load on single GPU model = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-7b1", load_in_4bit=True, # Load the model in 4-bit device_map="auto", # Automatic GPU placement trust_remote_code=True, # To load with FlashAttention2 bnb_4bit_compute_dtype=torch.float16 # Set compute data type to float16 ) model.config.use_cache = True # Enable for inference, as was False in Training # Infinite loop to continuously process input prompts while True: text = input("Enter your prompt: ") inputs = tokenizer(text, return_tensors="pt") # Tokenize the input text inputs = {key: tensor.to(device) for key, tensor in inputs.items()} # Move tensors to the defined device outputs = model.generate( **inputs, num_beams=4, # example beam search with 4 beams no_repeat_ngram_size=4, # Avoid repeating ngrams of size 4 decoder_start_token_id=tokenizer.bos_token_id, # Start token for decoding do_sample=False, # Disable random sampling to always get the same output for a given prompt max_new_tokens=160, # Maximum number of new tokens to generate num_return_sequences=1, # Return only one sequence eos_token_id=tokenizer.eos_token_id, # End of sequence token ID pad_token_id=tokenizer.pad_token_id, # Padding token ID ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) # Decode the output tokens to string print(generated_text) del inputs, outputs # Delete variables to free memory torch.cuda.empty_cache() # Empty the CUDA cache gc.collect() # Run garbage collection ``` </div>