Jan <> Nvidia on TensorRT-LLM

# Jan <> Nvidia on TensorRT-LLM ## Questions ### Asks - Sample code or guides for TensorRT-LLM purely in C++ - Leverage Triton Inference Server - Don't re-invent the wheel - OS Kernel - Drivers - e.g. CUDA-DNN - Look into Nemo-Megatron codebase - TensorRT: quantization & compilation, distillation of models - Nemo = training, tuning, alignment - Triton = serving, deployment, dynamic loading ### "Seamless Install" - We want to work on a fully automated install, that automates the tedious Windows Setup at the moment (e.g. CUDA Toolkit, Microsoft MPI) - Is there a way to automate the installation of dependencies for Windows? - We looked at Chat RTX and it seems they use Anaconda installation and saved Windows-related `pip ` and OS dep to the `.tar.gz` - We wanted to explore the pure C++ route as we are mainly a C++ shop ### Multiple GPU support - The biggest complaints among our desktop TensorRT-LLM users is that the converted models are too big for their RAM - Additionally, the current tutorials are for single-GPU inference, - Is it possible to run multi-GPU setups for Windows? (we think it's possible through Microsoft MPI?) ### Precompiled TensorRT-LLM Models? - We intend to ship a Model Converter as part of our TensorRT-LLM Extension - Would Nvidia be interested to work together to "precompile" TensorRT-LLM binaries for hardware types? - From our experience, users just want to download a binary and run it (vs. importing PyTorch, running conversion) ### "Universal" Binaries? - Can TensorRT-LLM C++ runtime be compiled into a single binary? (except perhaps for CUDA runtime) - See Powershell script - Is there a way to compile MPI into the single binary? (so user doesn't need to do separate install) - See Powershell script - Are compiled binaries GPU-specific? (e.g. 4090s vs A6000s) - Hardware-intensive - Engines need to be built for specific hardware - TBD (redacted) ### Python vs C++ - Will C++ runtime implement the `kv-cache` in the future? - Might be in Triton, will not be in TensorRT-LLM - But this is in Python? - How will the roadmaps between TensorRT-LLM Python and C++ differ in the future? - A lot of stuff is in Python, but no docs on how the C++ runtime can achieve it - Python is needed for the engine building phase ## Discussion Notes - Nvidia will release a Powershell script to help with installation - Will let Jan know when merged to main - Can we check out the WIP PR? - Consumer GPUs limitations - Recommend professional cards = RTX series - Consumer: "stop-start" journey