swearek
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Author: FLOCKAH <style> .body-md { background-color: #112131; color: #ffffff; padding: 10px; border: 2px solid #000; border-radius: 15px; } .title { color: #ffffff; font-size: 32px; text-align: center; font-family: 'Lucida Sans', 'Lucida Sans Unicode'; background-image: linear-gradient(45deg, #007bff, #0056b3, #003c7a, #002451); background-size: 200% 200%; animation: gradientShift 4s ease infinite; text-shadow: 0px 0px 10px rgba(0, 123, 200, 0.7); border-radius: 5px; box-shadow: 0 10px 20px -10px #333; padding: 5px 10px; display: block; margin: 10px auto; position: relative; transition: transform 0.3s ease-out; animation: blurShift 5s infinite; } .title::after { content: ''; position: absolute; bottom: 0; left: 7vh; display: block; width: 80%; height: 3px; background-color: #0056b3; border-radius: 5px; animation: shake 8s infinite; } @keyframes blurShift { 0% { filter: blur(1px); } 25% { filter: blur(0.5px); } 50% { filter: blur(0px); } 75% { filter: blur(0.5px); } 100% { filter: blur(1px); } } @keyframes gradientShift { 0% { background-position: 0% 50%; } 50% { background-position: 100% 50%; } 100% { background-position: 0% 50%; } } @keyframes loading { 0% { width: 0; } 50% { width: 100%; } 100% { width: 0; } } @keyframes shake { 0%, 100% { transform: translateX(0); } 10%, 30%, 50%, 70%, 90% { transform: translateX(-10px); } 20%, 40%, 60%, 80% { transform: translateX(10px); } } .shake { animation: shake 0.82s cubic-bezier(.36,.07,.19,.97) both; transform: translate3d(0, 0, 0); backface-visibility: hidden; perspective: 1000px; } .title:hover { transform: translateY(-5px); cursor: pointer; } .mini-title { color: #ace5ee; font-size: 24px; text-align: center; font-family: 'Lucida Sans', 'Lucida Sans Unicode'; background-image: linear-gradient(45deg, #0099cc, #0077a2, #005b7a, #003f5a); background-size: 200% 200%; animation: gradientShift 4s ease infinite; text-shadow: 0px 0px 8px rgba(0, 156, 204, 0.7); border-radius: 5px; box-shadow: 0 8px 16px -8px #333; padding: 4px 10px; display: block; margin: 8px auto; position: relative; transition: transform 0.3s ease-out; } .mini-title::after { content: ''; position: absolute; bottom: 0; left: 0; width: 100%; height: 2px; background-color: #0077a2; border-radius: 5px; animation: loading 3s infinite; } .mini-title:hover { transform: translateY(-3px); cursor: pointer; } .t-blue { color: #31aaf9; font-size: 18px; font-family: Lucida Console; text-align: center; padding: 2px; background-color: rgba(10,10,10,0.); display: block; } .tp-mint { color: #6fcccc; font-size: 17px; text-align: justify; padding: 20px; font-weight: bold; } .c-bleu { color: #36b1d1; font-size: 16px; text-align: justify; margin-left: 5px; } .tabu { padding: 20px; } .grid-table { border-collapse: collapse; width: 100%; } .grid-table th, .grid-table td { border: 1px solid #ddd; padding: 8px; text-align: left; color: #000; } .grid-table th { background-color: #4CAF50; color: white; } .grid-table td { background-color: #f2f2f2; } .grid-table tr:nth-child(even){background-color: #f2f2f2;} .grid-table tr:hover, .grid-table th:hover, .grid-table td:hover {background-color: #ddd;} .small-gray-text { display: block; color: gray; font-size: 0.8em; margin-top: 5px; text-decoration: underline; margin-left: 10px; text-underline-offset: 4px; } .link-hf { color: #a5b533; } </style> <div class="body-md"> <div class="title">Your Own AI</div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Introduction</div></br> <div class="tp-mint">1. How AI Generates Text from Text</div> <div class="c-bleu">AI models generate text by predicting the probability of a sequence of words based on a given input. It utilizes statistical methods, learning from vast datasets to produce coherent and contextually relevant text.</div> <div class="tp-mint">2. Tokens and Tokenization</div> <div class="c-bleu">Tokens are the building blocks of text in NLP. Tokenization is the process of converting text into tokens, which helps in understanding the context or meaning of the text. ModelTokenizer tokenizes text based on the model's training, while dataset tokenization structures the data for training purposes.</div> <div class="tp-mint">3. Choosing a Model</div> <div class="c-bleu">When selecting a model for AI development, consider the "weights" which are akin to the AI's knowledge base, derived from training data. These weights, in conjunction with the model's architecture, determine the AI's capabilities. Larger models typically have more weights, equating to a greater capacity to handle complex tasks. Time has proven that architecture plays more important role, as many small models outperform bigger ones</div></br> **Here's a quick cheatsheet for model sizes and their resource requirements:** <span class="small-gray-text">Assuming we only load the weights of the model:</span> <table class="grid-table"> <tr> <th>Model Size</th> <th>Full Precision (32-bit)</th> <th>Half Precision (16-bit)</th> <th>Quarter Precision (8-bit)</th> <th>4-Bit Precision</th> </tr> <tr> <td>Model 13b</td> <td>Requires 13x4 GB vRAM and 60 GB RAM</td> <td>Needs 13x2 GB vRAM</td> <td>13 GB vRAM</td> <td>13/2 GB VRAM, 30 GB RAM</td> </tr> <tr> <td>Model 7b</td> <td>Requires 7x4 GB vRAM and 30 GB RAM</td> <td>Needs 7x2 GB vRAM</td> <td>7 GB vRAM</td> <td>7/2 GB VRAM, 15 GB RAM</td> </tr> <tr> <td>Model 3b</td> <td>Requires 3x4 GB vRAM and 12 GB RAM</td> <td>Needs 3x2 GB vRAM</td> <td>3 GB vRAM</td> <td>3/2 GB VRAM, 6 GB RAM</td> </tr> </table> <div class="c-bleu">Larger models tend to use much more GPU assets, resulting with more accuracy, while smaller ones may provide more varied results while being less consuming. The key to choosing the right model involves balancing between model size, precision, architecture, and the quality of the training data.</br></div></br> <div class="c-bleu">Best to start from here: <a class="link-hf" href="https://huggingface.co">Hugging Face</a> <span>🤗</span> </div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Your Own AI: Preparation</div> <div class="tp-mint">1. Environment</br></div> <div class="c-bleu">AI development often utilizes cloud computing for access to high-performance GPUs</div> <div class="tabu"></br> **Here are widely-used cloud services for deep learning:** - Microsoft Azure - Amazon AWS SageMaker - Google Colaboratory - RunPod - Paperspace </div> <div class="c-bleu">Remember to select cloud instances with more GPU resources than the minimum requirement to accommodate data size and context length, which may significantly impact GPU usage</div> <div class="tp-mint">2. Architecture</div> <div class="c-bleu">The leading training architectures as of 2023 include: - LongLoRA - Long context (100k+!) window training method. Helpful when we train the model for summarizing & writing tasks - LoRA (Low-Rank Adaptation) - Focuses on targeted module training - QLoRA - Enhances LoRA with weight freezing for optimized GPU usage - GPTQ - Quantizes the model weights, without further training or tuning, this one found its place on this list, because it is extremely helpful when we want to load bigger model for inference, without further do - or when we choose to quantize ourselves instead of relaying on QLora. - GGUF - Enables running large models on CPU through quantization</div> <div class="tp-mint">3. Dataset</div> <div class="c-bleu">A crucial part of AI training, datasets should be well-structured and focused. They typically consist of entries with input and output fields:</div></br> ```python!= [ { “input”: “Need a quick Python script for checking if a word is a Palindrome”, “output”: “Sure… {code}” } ] # Names of entries do not matter, they can be named "sol_1" and "gol_2" # Can also be multiple: "input", "instruction", "output", "context", "text" # The key is to recognize which field is 'User' (Question) and 'AI' (Answer) ``` <div class="c-bleu">The dataset should be comprehensive for the intended use.</div></br> **Including diverse topics requires ample data to ensure effective training across all areas**</br> <div class="c-bleu"> Most common question that arises when looking for a dataset is `Does the data fit the task?` and there's no good answer for this one, except you can actually check that yourself - by running the training.</div></br> ```python # Here is your training script training_args = TrainingArguments( output_dir = ..., logging_dir = ..., evaluation_strategy = "steps", do_eval = True, eval_steps = 10, # Run 9 validation checks save_steps = 100, # This one will not save anyways! num_steps = 100, logging_steps = 10, # num_train_epochs=12, # Comment it, just for case # other arguments ) ``` <div class="c-bleu">This isn't a foolproof method but it's generally a good idea to run a short training and see how model performs basing on validation logs (or additional metrics) that will come out.</br></br> The data preparation is a large topic, this is why we will get back to it later... </div> <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="t-blue">Your Own AI: Training</div> <div class="tp-mint">1. Cloud Backup</div> <div class="c-bleu">Backing up is essential, yet often overlooked. The rationale for backups is clear - data in the cloud isn't permanent. Files typically get deleted after you disconnect from the instance, intentional or not.</br></br> Many companies provide cloud servers for backups, but some may not, or backups may not be included with your purchase. Ensure you can transfer the training files to your preferred cloud hosting, ideally at any point during training.</div> <div class="tp-mint">2. Training Script</div> <div class="c-bleu">By now, you should have selected the model, dataset, training architecture, cloud service, and the task type for your model.</br></br> Let's dive into programming without further ado:</div> </br> ```abc! # install necessary modules Terminal: pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf Notebook: !pip install -q -U bitsandbytes auto-gptq sentencepiece transformers peft accelerate datasets trl einops GPutil huggingface_hub tensorboard scipy protobuf ``` If any of the modules cannot be imported using ```import module_name``` then remove '-q' from ```pip install``` command above to disable the silenter. ```python= # Import necessary modules import GPUtil import os import time import torch import threading import sys from transformers import ( AutoModelForCausalLM, # If other than llama LlamaForCausalLM, # Only if llama AutoTokenizer, # Will download ModelTokenizer for training BitsAndBytesConfig, # Quantization TrainingArguments, Trainer, # If not quantized LlamaTokenizer, # Only if llama ) # Name of the model for training, Llama-2 as example model_name = "meta-llama/Llama-2-13b-hf" # In this example we will quantize the model to 4-bit architecture: compute_dtype = getattr(torch, "float16") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type= "nf4", # For 4-bit quantization bnb_4bit_compute_dtype = compute_dtype, bnb_4bit_use_double_quant = False # "This flag is used for nested quantization where the quantization constants from the first quantization are quantized again" source: https://huggingface.co/docs/transformers/main_classes/quantization ) # Now load the tokenizer tokenizer = LlamaTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = 'right' # 'left' may cause model to generate nonsense output as of 03.11.2023 model.config.pad_token_id = tokenizer.eos_token # set padding token id to tokenizer's end of sentence token model.config.use_cache = False # Set True for inference # Finally, download and load the model: max_retries = 3 # Number of retries before giving up retry_delay = 5 # Delay between retries in seconds # Retry downloading of the model on fail for i in range(max_retries): try: # If you don't train llama then AutoModelForCausalLM model = LlamaForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map={"": 0}, # Move the training to cuda index 0 to train using single-gpu trust_remote_code=False, # Enable on A100 and newer GPUs for FlashAttention2 ~15-20% less GPU consumption ) model.config.pretraining_tp = 2 # This presumably doubles the training accuracy break # Exit the loop if the model is successfully downloaded except Exception as e: print(f"An exception occurred: {e}") if i < max_retries - 1: print(f"Retrying in {retry_delay} seconds...") time.sleep(retry_delay) else: print("Max retries reached. Exiting.") raise # Re-raise the last exception to exit the script ``` </br> <div class="c-bleu">Once the model is downloaded successfully (which may take some time depending on its size), we can proceed with the model's further processing. We'll begin by freezing its weights, examining the internal modules, and assessing the model's trainable parameters.</div></br> ```abc= from peft import AutoPeftModelForCausalLM, LoraConfig, PeftModel, prepare_model_for_kbit_training model = prepare_model_for_kbit_training(model) peft_config = LoraConfig( lora_alpha=64, lora_dropout=0.01, r=16, bias="none", task_type="CAUSAL_LM", target_modules= ["q_proj","v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ) model = PeftModel(model, peft_config) print(model) # This will print all the modules, find Linear4bit and target them (more!=better) model.print_trainable_parameters() # import this if doesn't work ``` ``Where:`` - ``r, the dimension of the low-rank matrices`` - ``lora_alpha, the scaling factor for the low-rank matrices`` - ``lora_dropout, the dropout probability of the LoRA layers`` <div class="small-gray-text"> <a class="link-hf" href="https://huggingface.co/docs/peft/v0.6.0/en/quicktour#peftconfig">Hugging Face PEFT Documentation<span>🤗</span></a> </div> <div class="c-bleu"> </br> Let's proceed with a simple implementation of a GPU Memory tracker, that will inform us how much memory is model consuming in real-time </div></br> ```python= def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] util = gpu.memoryUtil * 100 free = gpu.memoryFree used = gpu.memoryUsed total = gpu.memoryTotal sys.stdout.write('\033[K') # Clear to the end of line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() time.sleep(5) # Start GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True gpu_thread.start() ``` </br> <div class="c-bleu"> We also want to take care of loading the dataset ourselves. - Lazy loading means data is loaded (supplied to the trainer) on the fly as needed rather than all at once at the beginning. This is beneficial when dealing with large datasets that do not fit into memory - custom_data_collator is a function that takes a batch of samples from the dataset and collates them into a single batch. This is a common requirement for batch processing in PyTorch, as data needs to be gathered into a batch and converted to tensors of the same size. ```python= from datasets import load_dataset from torch.utils.data import Dataset, random_split MAX_LENGTH = 512 def custom_data_collator(batch): input_ids = [item[0] for item in batch] attn_masks = [item[1] for item in batch] return { 'input_ids': torch.stack(input_ids), 'attention_mask': torch.stack(attn_masks), 'labels': torch.stack(input_ids) } class OnTheFly(Dataset): # Lazy loading, optimizes GPU def __init__(self, txt_list, tokenizer): self.txt_list = txt_list self.tokenizer = tokenizer def __len__(self): return len(self.txt_list) def __getitem__(self, idx): txt = self.txt_list[idx] encodings_dict = self.tokenizer(txt, truncation=True, max_length=MAX_LENGTH, padding="max_length") input_ids = torch.tensor(encodings_dict['input_ids']) attn_masks = torch.tensor(encodings_dict['attention_mask']) return input_ids, attn_masks ``` To load the dataset we will utilize standard load_dataset function from datasets module, the function loads the dataset to the memory, while OnTheFly assumes it's already there and passes that to the trainer. ```python= training_data = load_dataset("flytech/llama-python-codes-30k", split='train') # In case you use different dataset, remember to map combined fields, ['text'] in this example: .map(lambda example: {'text': example['instruction'] + ' ' + example['input'] + ' ' + example['output']})['text'] texts = training_data # Initialize the dataset dataset = OnTheFly(texts, tokenizer) # Change these ratios as needed train_ratio = 0.95 train_dataset, val_dataset = random_split(dataset, [int(train_ratio * len(dataset)), len(dataset) - int(train_ratio * len(dataset))]) # Random split will randomly split your dataset and produce training and validation dataset, feel free to completely skip this and supply trainer with different split or dataset. ``` </div> <div class="tp-mint">3. Training </div></br> <b><div class="c-bleu">Here we will:</b></br></br> 1. Use TensorBoard to display our saved logs with visually appealing metrics. 2. Set a batch size of 16, which will consume approximately 24GB of VRAM with the model. 3. Train for 4 epochs, and possibly increase if the model is still learning (as indicated by a decrease in training loss). 4. Enable Hugging Face checkpoints! 5. Use the adamw_bnb_8bit optimizer. ```python= from transformers.integrations import TensorBoardCallback training_arguments = TrainingArguments( output_dir="/content/Ruckus-13b-24", logging_dir="/content/Ruckus-13b-24", evaluation_strategy="steps", do_eval=True, # Perform evaluation (disable for no evaluation at all) save_total_limit=14, per_device_train_batch_size=16, # a gradient_accumulation_steps=1, # b per_device_eval_batch_size=16, num_train_epochs=4, optim="adamw_bnb_8bit", save_steps=200, # Doesn't have to be same as eval and logging steps but it's recommended logging_steps=200, # same as eval steps learning_rate=2e-4, # learning rate is extremely important, defines how fast the model learns the data but may cause damage if not properly set eval_steps=200, # Perform evaluation of validation dataset every 200 optimization steps eval_accumulation_steps=2, fp16=False, # Set to True, use bf16 if A100 and newer for little bit better outcome bf16=True, # Set to True only on A100 GPUs and newer (ampere) #max_grad_norm=1.0, # gradient normalization #weight_decay=0.01, # weight decay, similar to lora dropout #warmup_ratio=0.1, # use along with scheduler type linear or cosine lr_scheduler_type="constant", save_safetensors=True, push_to_hub=True, hub_model_id="Your-AI-Name", hub_token="hf_herusygkj1823saddngb118", # Fake one, but similar, create Hugging Face account and go into Settings->Access tokens hub_strategy="checkpoint", # enable checkpoints remove_unused_columns=True, dataloader_num_workers=4 # utilize CPU cores ) # The total batch size is always per_device_train_batch_size × gradient_accumulation_steps (a*b) tensorboard_callback = TensorBoardCallback() # <- Callback for pretty metrics trainer = SFTTrainer( model=model, args=training_arguments, train_dataset=train_dataset, eval_dataset=val_dataset, dataset_text_field="text", peft_config=peft_config, max_seq_length=512, # <- Max length of a dataset entry, every entry above will be truncated to this length (in tokens) tokenizer=tokenizer, data_collator=custom_data_collator, packing=False, # callbacks=[tensorboard_callback] ) ``` </br> Now the best part. Let's actually start the training by calling train() method on our Supervised-FineTuning Trainer, there are more Trainer classes available, we are utilizing most common one for the sake of example:</br> </br> ```python= trainer.train() # or to resume from checkpoint: trainer.train( resume_from_checkpoint="/workspace/checkpoint-xxx" ) ``` </div> <div class="tp-mint">After successfully training the model we can save its weights as an adapter, here's how:</div> ```pyton= # First let's save the model's weights: output_dir = os.path.join("/workspace/MyAi-13b-4bit", "final_checkpoint") trainer.model.save_pretrained(output_dir) trainer.save_model("/workspace/MyAi-13b-4bit") # safe_serialization will save model in safetensors format trainer.model.save_pretrained(output_dir, safe_serialization=True) # torch.save(model.state_dict(), "/workspace/MyAi-13b-4bit/MyAi-13b-4bit.pth") # Save the model weights in .pth if you like to ``` <div class="tp-mint">Now we are going to merge trained adapter and model: </div> ```python= # Recycle the garbage collected in GPU del model torch.cuda.empty_cache() model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16) # Ensure that model is not a meta tensor if isinstance(model, torch.Tensor) and model.is_meta: model = model.to(device="cpu", non_blocking=True, dtype=torch.float32) # Merge and unload the model model = model.merge_and_unload() # Save the merged model model.save_pretrained(output_dir) ``` <hr style="height: 2px; background-color: #116799; border: none;" /> <div class="c-bleu"> Don't forget to upload merged model to Hugging Face Hub, so you can easily launch the inference:</div> </br> ```python= hub_name = 'MyAi-13b-4bit' model.push_to_hub(hub_name) tokenizer.push_to_hub(hub_name) ``` </br> <div class="t-blue">Your Own AI: Inference</div> </br> <b><div class="tp-mint">What is an inference?</div></b> <div class="c-bleu">Inference refers to the process of using a trained neural network model to make predictions based on new input data. This phase follows the training, where the model has learned to recognize patterns or perform tasks. During inference, the model's weights are frozen, and it's utilized to interpret unseen data. The ultimate aim is to generalize the learning from the training phase to actual-world applications, such as language translation, image recognition, or in our case, generating text.</div></br> <div class="c-bleu">Inference can be performed in various ways depending on the application. Below we'll explore two methods—running a Flask application with ngrok and a while True loop that takes user input—to demonstrate how our trained model can be interacted with in real-time.</div> ### <div class="mini-title">Using Flask and ngrok for Inference</div> ```python !pip install flask_ngrok from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig import torch import GPUtil import time import threading import sys from flask import Flask, request from flask_ngrok import run_with_ngrok # Function to monitor GPU usage def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] util = gpu.memoryUtil * 100 free = gpu.memoryFree used = gpu.memoryUsed total = gpu.memoryTotal sys.stdout.write('\033[K') # Clear to the end of line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() time.sleep(2) # Start the GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True gpu_thread.start() # Initialize tokenizer and model tokenizer = AutoTokenizer.from_pretrained("flytech/Ruckus-PyAssi-13b") device = "cuda:0" model = AutoModelForCausalLM.from_pretrained("flytech/Ruckus-PyAssi-13b", load_in_4bit=True, device_map="auto", bnb_4bit_compute_dtype=torch.float16) model.config.use_cache = True # Define the generation configuration generation_config = GenerationConfig( temperature=0.95, top_p=0.92, # num_beams=4, # no_repeat_ngram_size=4, decoder_start_token_id=tokenizer.bos_token_id, do_sample=True, max_new_tokens=1024, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) # Create Flask app and apply ngrok app = Flask(__name__) run_with_ngrok(app) # Define route for generation @app.route("/generate", methods=["POST"]) def generate(): data = request.get_json() prompt = data.get("prompt", "") inputs = tokenizer(prompt, return_tensors="pt", truncation=True) inputs = {key: tensor.to(device) for key, tensor in inputs.items()} outputs = model.generate(**inputs, **generation_config.to_dict()) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) return {"generated_text": generated_text} if __name__ == '__main__': app.run() ``` <div class="c-bleu">In the above snippet, we utilize Flask to create a web server and ngrok to tunnel our local server to a public URL, enabling external access. The application monitors the GPU usage in a separate thread, ensuring we have real-time updates on the resources consumed by our AI model. The /generate endpoint takes input via a POST request and uses our model to generate a text based on the prompt.</div></br> <div class="mini-title">Simple Inference</div> ```python= import gc, torch from transformers import AutoTokenizer, AutoModelForCausalLM import GPUtil import time import threading import sys # Function to monitor GPU usage statistics def monitor_gpu_usage(): while True: gpu = GPUtil.getGPUs()[0] # Get the first GPU util = gpu.memoryUtil * 100 # Calculate utilization percentage free = gpu.memoryFree # Free memory in MB used = gpu.memoryUsed # Used memory in MB total = gpu.memoryTotal # Total memory in MB sys.stdout.write('\033[K') # ANSI escape sequence to clear line sys.stdout.write(f"\rGPU RAM Free: {free:.0f}MB | Used: {used:.0f}MB | Util {util:3.0f}% | Total {total:.0f}MB") sys.stdout.flush() # Flush the standard output time.sleep(1) # Wait for 1 second before the next check # Create and start the GPU monitoring thread gpu_thread = threading.Thread(target=monitor_gpu_usage) gpu_thread.daemon = True # Set the thread as a daemon gpu_thread.start() # Load tokenizer and model from your HF Repository tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1") device = "cuda:0" # Load on single GPU model = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-7b1", load_in_4bit=True, # Load the model in 4-bit device_map="auto", # Automatic GPU placement trust_remote_code=True, # To load with FlashAttention2 bnb_4bit_compute_dtype=torch.float16 # Set compute data type to float16 ) model.config.use_cache = True # Enable for inference, as was False in Training # Infinite loop to continuously process input prompts while True: text = input("Enter your prompt: ") inputs = tokenizer(text, return_tensors="pt") # Tokenize the input text inputs = {key: tensor.to(device) for key, tensor in inputs.items()} # Move tensors to the defined device outputs = model.generate( **inputs, num_beams=4, # example beam search with 4 beams no_repeat_ngram_size=4, # Avoid repeating ngrams of size 4 decoder_start_token_id=tokenizer.bos_token_id, # Start token for decoding do_sample=False, # Disable random sampling to always get the same output for a given prompt max_new_tokens=160, # Maximum number of new tokens to generate num_return_sequences=1, # Return only one sequence eos_token_id=tokenizer.eos_token_id, # End of sequence token ID pad_token_id=tokenizer.pad_token_id, # Padding token ID ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) # Decode the output tokens to string print(generated_text) del inputs, outputs # Delete variables to free memory torch.cuda.empty_cache() # Empty the CUDA cache gc.collect() # Run garbage collection ``` </div>

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully