Blogpost NVIDIA

Over the last 7 months, our team has focused on adapting Large Language Models (LLMs) for various various CPUs and GPUs. Jan is powered by [Nitro](https://nitro.jan.ai/), an open source inference engine supports across OS and hardware. Today, we're introducing Jan Server, aimed at the enterprise and featuring versatility and high performance. Jan Server supports diverse AI applications, especially LLMs, by integrating NVIDIA TensorRT LLM and leveraging NVIDIA’s GPU infrastructure. ## Introduction TensorRT-LLM TensorRT-LLM is a cutting-edge tool designed to streamline the deployment of LLMs on NVIDIA GPUs, offering: - **Python API for Ease of Use**: Simplifies the process of defining LLMs and constructing TensorRT engines with advanced optimizations for efficient inference. - **Optimization for NVIDIA GPUs**: Incorporates [NVIDIA TensorRT's compiler](https://github.com/NVIDIA/TensorRT), specialized kernels, and multi-GPU communication primitives to enhance performance on Tensor Core GPUs. - **Performance Enhancements**: Achieves notable improvements in both initialization speed (time to first token) and runtime efficiency (time per output token). - **Advanced Quantization Support**: Offers support for INT4/INT8 precision models, alongside FP16 and [BF16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) activations, and implements the [SmoothQuant](https://arxiv.org/abs/2211.10438) technique for optimal performance and accuracy balance. ## Key features ### In-flight Batching Facilitates dynamic request handling via the Batch Manager, improving GPU utilization by allowing ongoing request processing without waiting for all slots to be idle. More details can be found in the [Batch Manager documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/batch_manager.md). ![image](https://hackmd.io/_uploads/Hy5vs8S5T.png) Fig 1. In-Flight Batching Demonstration: The batch manager initiates processing of subsequent sequences (Seq 4 and 5) in available slots without delay, instead of pausing until all slots become idle, which might be caused by the extended duration of Seq 2. ### Optimized Attention TensorRT-LLM enhances GPT-like model performance through advanced attention mechanisms: - **Multi-head Attention (MHA)**: Follows the "Attention Is All You Need" paper, involving batched matrix multiplication, softmax, and another batched matrix multiplication. - **[Multi-query Attention (MQA)](https://arxiv.org/abs/1911.02150)** and **[Group-query Attention (GQA)](https://arxiv.org/abs/2307.09288)**: These variants of MHA use fewer key/value (K/V) heads compared to the number of query heads, optimizing computational efficiency. - **Paged KV Cache**: This feature segments the KV cache into blocks managed by a cache manager, enhancing memory efficiency during request processing by dynamically allocating and recycling cache blocks. t works like virtual memory, dynamically adjusting space as needed, cutting down memory waste to roughly 4% from the usual 60-80%. ### Various Support Quantization TensorRT-LLM implements several quantization strategies to balance performance and accuracy: - **INT8 SmoothQuant (W8A8)** maintains network accuracy using INT8 for both activations and weights, offering up to 1.56x speedup and 2x memory reduction with minimal accuracy loss. [SmoothQuant details](https://arxiv.org/abs/2211.10438). - [**AWQ**](https://arxiv.org/abs/2306.00978) (W4A16) is [post-training quantization](https://www.tensorflow.org/lite/performance/post_training_quantization) method for 4-bit weights, aiming to minimize the mean squared error. During inference, weights are dynamically dequantized to FP16, optimizing GPU inference performance with low memory overhead. This method is particularly effective in environments where memory resources are limited, and computational efficiency is paramount. It's an alternative option of [GPTQ](https://arxiv.org/abs/2210.17323). More detail of [support matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md#support-matrix). ### Parallelism and Decoding Strategies TensorRT-LLM supports multiple parallelism strategies for scaling model inference: - **Tensor Parallelism:** Splits model tensors across GPUs to enhance computational efficiency. ![gif](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/_images/tp.gif) - **Pipeline Parallelism:** Divides model layers or stages across GPUs, optimizing workload distribution. ![gif](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/_images/pp.gif) ### Decoding Strategies **Rotary Positional Embedding (RoPE)**: adapts to any sequence length, reduces dependency between tokens, and adds relative position encoding to the model's self-attention process. It can easily extend the context length of a model from 4k tokens to 8k or 16k with minimal performance lost. ### Model Support TensorRT-LLM is compatible with a wide array of models, from decoder-only and encoder-only architectures to multi-modal frameworks: - **Decoder-Only Models:** LLaMA family, GPT series, Mistral, etc. - **Encoder-Only Models:** BERT, BART, etc. - **Decoder-Encoder Models:** T5 - **Multi-Modal Models:** BLIP2, LLaVA-v1.5-7B, Nougat series. --- # TODO ## Using with Jan ### Chat - How easy to use TrtLLM with Jan? Idea: Easily change the backend from Nitro to Trtllm... - Performance? Idea: Show some numbers **Image using with Jan server** ### RAG - How easy to use RAG with TrtLLM with Jan Idea: A few clicks to perform RAG in Jan with TrtLLM Ref: https://github.com/NVIDIA/trt-llm-rag-windows#readme --- Ref: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/ https://nvidia.github.io/TensorRT-LLM/gpt_attention.html https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html#tensor-parallelism https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md