# Jan <> Nvidia on TensorRT-LLM
## Questions
### Asks
- Sample code or guides for TensorRT-LLM purely in C++
- Leverage Triton Inference Server
- Don't re-invent the wheel
- OS Kernel
- Drivers
- e.g. CUDA-DNN
- Look into Nemo-Megatron codebase
- TensorRT: quantization & compilation, distillation of models
- Nemo = training, tuning, alignment
- Triton = serving, deployment, dynamic loading
### "Seamless Install"
- We want to work on a fully automated install, that automates the tedious Windows Setup at the moment (e.g. CUDA Toolkit, Microsoft MPI)
- Is there a way to automate the installation of dependencies for Windows?
- We looked at Chat RTX and it seems they use Anaconda installation and saved Windows-related `pip ` and OS dep to the `.tar.gz`
- We wanted to explore the pure C++ route as we are mainly a C++ shop
### Multiple GPU support
- The biggest complaints among our desktop TensorRT-LLM users is that the converted models are too big for their RAM
- Additionally, the current tutorials are for single-GPU inference,
- Is it possible to run multi-GPU setups for Windows? (we think it's possible through Microsoft MPI?)
### Precompiled TensorRT-LLM Models?
- We intend to ship a Model Converter as part of our TensorRT-LLM Extension
- Would Nvidia be interested to work together to "precompile" TensorRT-LLM binaries for hardware types?
- From our experience, users just want to download a binary and run it (vs. importing PyTorch, running conversion)
### "Universal" Binaries?
- Can TensorRT-LLM C++ runtime be compiled into a single binary? (except perhaps for CUDA runtime)
- See Powershell script
- Is there a way to compile MPI into the single binary? (so user doesn't need to do separate install)
- See Powershell script
- Are compiled binaries GPU-specific? (e.g. 4090s vs A6000s)
- Hardware-intensive
- Engines need to be built for specific hardware
- TBD (redacted)
### Python vs C++
- Will C++ runtime implement the `kv-cache` in the future?
- Might be in Triton, will not be in TensorRT-LLM
- But this is in Python?
- How will the roadmaps between TensorRT-LLM Python and C++ differ in the future?
- A lot of stuff is in Python, but no docs on how the C++ runtime can achieve it
- Python is needed for the engine building phase
## Discussion Notes
- Nvidia will release a Powershell script to help with installation
- Will let Jan know when merged to main
- Can we check out the WIP PR?
- Consumer GPUs limitations
- Recommend professional cards = RTX series
- Consumer: "stop-start" journey