Integrating NVIDIA TensorRT-LLM

# **What is TensorRT-LLM?** What it is and how it is different should be answered in the first two sentences. Many people in the Local LLM space are only familiar with GGUF, and aren't aware of other engines There is another engine, TensorRT-LLM, that is built by Nvidia and is the fastest way to run on Nvidia GPUs TensorRT-LLM is more difficult to work with, and requires specialized hardware with VRAM. For popular 7b models like Mistral, this requires a 4090 or equivalent (laptop GPUs won't work). In exchange for this, TensorRT-LLM is faster, and <insert benefits here> One analogy that I would like to float: TensorRT-LLM is like C++ or Rust, while GGUF is like Java and the JVM **1. Model Size and Memory** The key obstacle for TensorRT-LLM is the model sizes, which they call TensorRT-LLM engines Explain what goes into a TensorRT-LLM engine and why it's size is larger Explain that this needs to fit into VRAM, which means that realistically only larger GPUs can fit in Laptop GPUs = 1.1b models = GGUF size, vs. TensorRT-LLM size 4090s or 3090s = 7b models = size H100s = this is what TensorRT-LLM is built for, and really shines **2. Performance** I would compare the performance of TensorRT-LLM vs GGUF TinyLlama-1.1 fp16, which can fit into a 4050 laptop GPU = 2gb in VRAM Tinyllama-q16 (is there such a thing?) otherwise use q8 = 1gb in RAM Show difference in t/s for (a) TensorRT-LLM, (b) GGUF using GPU, (c) GGUF using CPU Show difference in inference quality for a given hardware (e.g. q8 vs. fp16) I have included the screenshots for "quality of thought" differences above # **What do we think TensorRT-LLM is useful for?** 1. Laptop Inference If you have a laptop GPU, possible to run small models at fp16 GGUF q4 are still better for 7b models, many of which don't fit into In the future, it may be possible to train and serve small models 2. Desktop Inference 4090s and 3090s are likely to Nvidia will likely have bigger VRAM models in the future (link out to 5090 rumors) We're also likely to see TensorRT-LLM model sizes shrink 3. H100s This is where TensorRT-LLM really shines, and delivers fast inference Nvidia is likely to optimize this for their hardware stack Reference for styling: https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/