Triton Documentation

# Triton Documentation Triton is a Inference Server enabling teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI. This is the tree summary of all Triton documents but [main documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html) is here. ## Triton Deployment Documentation ### Triton Tutorials <details> <summary>Details</summary> This tutorial provide a quick start process for beginners to start using triton through many examples and do not goes into too much technical details. - [Conceptual Guides](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide): - [Part 1: Model deployment](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_1-model_deployment#model-configuration) - [Part 2: Improving resources utilization](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization) - [Part 3: Optimizing triton configuration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration) - [Part 4: Inference acceleration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration) - [Part 5: Model ensembles](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) - [Part 6: Building complex pipeline](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) - [Part 7: Iterative scheduling](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling) </details> <details> <summary>Discussion</summary> [Model max_batch_size](https://github.com/triton-inference-server/server/issues/3527) </details> ### Triton Inference Server <details> <summary>User guide</summary> [Architecture](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md) [Decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md) [Model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) [Model management](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md) [Model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md) [Ragged batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/ragged_batching.md) [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md) [Perf analyzer](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md) [Performance tuning](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/performance_tuning.md) </details> <details> <summary>Customization guide</summary> [Inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md) [Repository Agent](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/repository_agents.md) </details> ### Triton Client <details> <summary>Perf Analyzer</summary> [Triton Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) </details> <details> <summary>GRPC Protocol</summary> [Ensemble image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/ensemble_image_client.py) [GRPC client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_client.py) [GRPC byte content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_byte_content_client.py) [GRPC explicit int8 content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int8_content_client.py) [GRPC explicit int content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int_content_client.py) [GPRC image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_image_client.py) [Image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py) [Memory growth test](https://github.com/triton-inference-server/client/blob/main/src/python/examples/memory_growth_test.py) [Reuse infer objects client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/reuse_infer_objects_client.py) [Simple GRPC AIO infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_infer_client.py) [Simple GRPC AIO sequence stream](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_sequence_stream_infer_client.py) [Simple GRPC async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_async_infer_client.py) [Simple GRPC cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_cudashm_client.py) [Simple GRPC custom args client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_args_client.py) [Simple GRPC custom repeat](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_repeat.py) [Simple GRPC health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_health_metadata.py) [Simple GRPC infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_infer_client.py) [Simple GRPC keepalive client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_keepalive_client.py) [Simple GRPC model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_model_control.py) [GRPC sequence stream infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_stream_infer_client.py) [GRPC sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_sync_infer_client.py) [Simple GRPC shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_client.py) [Simple GRPC shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_string_client.py) [Simple GRPC string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_string_infer_client.py) </details> <details> <summary>HTTP Protocol</summary> [Simple HTTP aio infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_aio_infer_client.py) [Simple HTTP async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_async_infer_client.py) [Simple HTTP cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py) [Simple HTTP health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_health_metadata.py) [Simple HTTP infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_infer_client.py) [Simple HTTP model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_model_control.py) [Simple HTTP sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_sequence_sync_infer_client.py) [Simple HTTP shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_client.py) [Simple HTTP shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_string_client.py) [Simple HTTP string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_string_infer_client.py) </details> ### Model Analyzer <details> <summary>Details</summary> [Install](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md) [CLI](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md) [Config](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#objective) [Config search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#quick-search-mode) [Ensemble quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/ensemble_quick_start.md) [Kubernets Deploy](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/kubernetes_deploy.md) [Launch mode](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md) [Metrics](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/metrics.md) [Reports](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/report.md) [Multi-model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/mm_quick_start.md) [BLS model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/bls_quick_start.md) [Checkpointing in Model Anlayzer](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/checkpoints.md) </details> ### Model Navigator An inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs. The Triton Model Navigator streamlines the process of moving models and pipelines implemented in PyTorch, TensorFlow, and/or ONNX to TensorRT. <details> <summary>Details</summary> [Optimize Torch linear model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/01_optimize_torch_linear_model) [Optimize and verify model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/02_optimize_and_verify_model) [Python Profile function](https://github.com/triton-inference-server/model_navigator/tree/main/examples/21_nav_profile_python_function) </details> ### Backends A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing). <details> <summary>Python Backend</summary> [Business logic scripting](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting) [Add sub](https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub) [BLS decoupled](https://github.com/triton-inference-server/python_backend/tree/main/examples/bls_decoupled) [Custom metrics](https://github.com/triton-inference-server/python_backend/tree/main/examples/custom_metrics) </details> <details> <summary>Onnx Runtime Backend</summary> [ONNX runtime with TensoRT EP](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization) [ONNX runtime with CUDA EP](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-cuda-execution-provider-optimization) [ONNX runtime with OpenVino](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-openvino-optimization) [Other optimization with ONNX](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#other-optimization-options-with-onnx-runtime) </details> <details> <summary>TensorRT Backend</summary> [Intro to Notebooks](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/IntroNotebooks) [Semantic segmentation](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/SemanticSegmentation) [Deploy to Triton](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) [Quantization tutorial](https://github.com/NVIDIA/TensorRT/blob/main/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb) [Torch-TensorRT with Triton](https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html) </details> <details> <summary>Dali Backend</summary> [Training to inference](https://github.com/triton-inference-server/dali_backend/blob/main/docs/tutorials/training_to_inference.md) Examples [Dali plugin](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/dali_plugin) [Efficient net](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/efficientnet) [Inception ensemble](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/inception_ensemble) [Perf Analyzer](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer) [ResNet50 TRT](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/resnet50_trt) </details> <details> <summary>Pytorch Backend</summary> [Docs](https://github.com/triton-inference-server/pytorch_backend) </details> <details> <summary>Paddle Paddle Backend</summary> [Quick start](https://github.com/triton-inference-server/paddlepaddle_backend) [Model configuration](https://github.com/triton-inference-server/paddlepaddle_backend/blob/main/docs/model_configuration.md) [Examples](https://github.com/triton-inference-server/paddlepaddle_backend/tree/main/examples) </details> <details> <summary>Fast Transformer Backend</summary> [Faster Transformer](https://github.com/NVIDIA/FasterTransformer/) [Faster Transformer backend](https://github.com/triton-inference-server/fastertransformer_backend) </details> ### Pytriton A Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python. <details> <summary>Quick Start</summary> [Add sub notebook](https://github.com/triton-inference-server/pytriton/tree/main/examples/add_sub_notebook) [Hugging face Resnet Pytorch](https://github.com/triton-inference-server/pytriton/tree/main/examples/huggingface_resnet_pytorch) </details> # Quick start ### Setup docker images Pull nividia Triton server image ```bash! docker pull nvcr.io/nvidia/tritonserver:24.06-py3 ``` Pull nvidia Triton client for inference ```bash! docker pull nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` ### Triton server Create and run container for Triton server ```bash! docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v d/Documents/GitHub/MY-REPO/triton/model_repository:/models nvcr.io/nvidia/tritonserver:24.06-py3 tritonserver --model-repository=/models ``` Recommend using docker-compose file ```dockerfile! services: triton-server: image: nvcr.io/nvidia/tritonserver:24.06-py3 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] command: tritonserver --model-repository=/models --model-control-mode=explicit --load-model=densenet_onnx ports: - "8000:8000" - "8001:8001" - "8002:8002" volumes: - ../model_repository:/models environment: - NVIDIA_VISIBLE_DEVICES=1 ``` Then enter bash or attach shell in vscode for tracking logging ```bash! docker-compose ps docker-compose exec triton-server bash ``` ### Triton client Create and run container for Triton client: ```bash! docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` Then in the bash, run the premade file image_client ```bash! /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg ``` However, we can also create our own client with own `http` or `grpc` protocols: ```python! import numpy as np import requests import json # Define the server URL url = "http://localhost:8000/v2/models/densenet_onnx/infer" # Create input data (example: an array of zeros) input_data = np.zeros((3, 224, 224), dtype=np.float32) # Prepare the data in JSON format inputs = [ { "name": "data_0", "shape": input_data.shape, "datatype": "FP32", "data": input_data.tolist() } ] outputs = [ { "name": "fc6_1" } ] request_payload = { "inputs": inputs, "outputs": outputs } # Send the request to the Triton server response = requests.post(url, json=request_payload) # Check the response status if response.status_code == 200: response_json = response.json() print(response_json.keys()) output_data = np.array(response_json["outputs"][0]["data"]).reshape(response_json["outputs"][0]["shape"]) print("Output Data: ", output_data) else: print("Request failed with status code: ", response.status_code) print("Response: ", response.text) ``` Then run this docker-compose.yml in client directory: ```dockerfile! services: triton-client: image: nvcr.io/nvidia/tritonserver:24.06-py3-sdk network_mode: host tty: true stdin_open: true restart: unless-stopped volumes: - ../:/workspace/inference/ ``` ### Model analyzer Create the output dir first to avoid error ```bash mkdir output_model/output ``` Run the triton server with above docker compose file. And now run the container for it to automatically connect to triton server ```bash! docker run -it --gpus all -v /var/run/docker.sock:/var/run/docker.sock -v d/Documents/GitHub/MY-REPO/triton/model_analyzer:/workspace/model_analyzer --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` Now in the containter/machine, we run triton model analyzer with this: ```bash! model-analyzer profile \ --model-repository /workspace/model_analyzer/ \ --profile-models densenet_onnx --triton-launch-mode=remote \ --output-model-repository-path /workspace/model_analyzer/model_output/output \ --export-path /workspace/model_analyzer/profile_results \ --override-output-model-repository ``` If you just want to test with limit experiments, use this: ```bash! --run-config-search-max-concurrency 2 --run-config-search-max-model-batch-size 2 --run-config-search-max-instance-count 2 ``` ## Other savings **Configuration of medical segmentation models** ```config! name: "UNet_ImageCHD_128" backend: "onnxruntime" dynamic_batching { } input [ { name: "input" data_type: TYPE_FP32 dims: [ -1, 1, 128, 128 ] } ] output [ { name: "output" data_type: TYPE_FP32 dims: [ -1, 8, 128, 128 ] } ] instance_group [ { kind: KIND_GPU } ] optimization { execution_accelerators { gpu_execution_accelerator: [ { name: "tensorrt" parameters { key: "precision_mode" value: "FP16" } parameters { key: "max_workspace_size_bytes" value: "4294967296" } parameters { key: "trt_engine_cache_enable" value: "true" } parameters { key: "trt_engine_cache_path" value: "/models/UNet_ImageCHD_128/1" } } ] } } ``` **Perf analyzer** ```bash! perf-analyzer -m text_reg_batch -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95 ``` **Perf with Dynamic Batching, Instance group 2, TensorRT acceleration on GPU RTX 3060** ```bash! *** Measurement Settings *** Batch size: 2 Service Kind: TRITON Using "time_windows" mode for stabilization Stabilizing using p95 latency Measurement window: 5000 msec Latency limit: 0 msec Concurrency limit: 16 concurrent requests Using synchronous calls for inference Request concurrency: 2 Client: Request count: 1191 Throughput: 132.291 infer/sec p50 latency: 30573 usec p90 latency: 35938 usec p95 latency: 37684 usec p99 latency: 40669 usec Avg HTTP time: 30191 usec (send/recv 143 usec + response wait 30048 usec) Server: Inference count: 2384 Execution count: 1131 Successful request count: 1192 Avg request latency: 29519 usec (overhead 78 usec + queue 228 usec + compute input 101 usec + compute infer 29073 usec + compute output 38 usec) Request concurrency: 4 Client: Request count: 1308 Throughput: 145.275 infer/sec p50 latency: 52808 usec p90 latency: 67876 usec p95 latency: 73463 usec p99 latency: 83525 usec Avg HTTP time: 64641 usec (send/recv 158 usec + response wait 64483 usec) Server: Inference count: 2616 Execution count: 947 Successful request count: 1308 Avg request latency: 63791 usec (overhead 115 usec + queue 15192 usec + compute input 210 usec + compute infer 48223 usec + compute output 50 usec) Request concurrency: 6 Client: Request count: 1682 Throughput: 186.767 infer/sec p50 latency: 63495 usec p90 latency: 78766 usec p95 latency: 82845 usec p99 latency: 91286 usec Avg HTTP time: 64136 usec (send/recv 165 usec + response wait 63971 usec) Server: Inference count: 3364 Execution count: 846 Successful request count: 1682 Avg request latency: 63154 usec (overhead 147 usec + queue 19362 usec + compute input 315 usec + compute infer 43270 usec + compute output 59 usec) Request concurrency: 8 Client: Request count: 1995 Throughput: 221.592 infer/sec p50 latency: 71015 usec p90 latency: 87392 usec p95 latency: 93610 usec p99 latency: 104270 usec Avg HTTP time: 72134 usec (send/recv 153 usec + response wait 71981 usec) Server: Inference count: 3990 Execution count: 747 Successful request count: 1995 Avg request latency: 71172 usec (overhead 179 usec + queue 21863 usec + compute input 364 usec + compute infer 48702 usec + compute output 63 usec) Request concurrency: 10 Client: Request count: 2237 Throughput: 248.444 infer/sec p50 latency: 79531 usec p90 latency: 98043 usec p95 latency: 103645 usec p99 latency: 113315 usec Avg HTTP time: 80393 usec (send/recv 154 usec + response wait 80239 usec) Server: Inference count: 4474 Execution count: 680 Successful request count: 2237 Avg request latency: 79428 usec (overhead 195 usec + queue 25674 usec + compute input 443 usec + compute infer 53053 usec + compute output 61 usec) Request concurrency: 12 Client: Request count: 2484 Throughput: 275.839 infer/sec p50 latency: 85227 usec p90 latency: 102343 usec p95 latency: 108260 usec p99 latency: 119996 usec Avg HTTP time: 86938 usec (send/recv 145 usec + response wait 86793 usec) Server: Inference count: 4968 Execution count: 624 Successful request count: 2484 Avg request latency: 85997 usec (overhead 228 usec + queue 28372 usec + compute input 458 usec + compute infer 56875 usec + compute output 63 usec) Request concurrency: 14 Client: Request count: 2472 Throughput: 274.518 infer/sec p50 latency: 105608 usec p90 latency: 122302 usec p95 latency: 125339 usec p99 latency: 132344 usec Avg HTTP time: 101783 usec (send/recv 148 usec + response wait 101635 usec) Server: Inference count: 4944 Execution count: 623 Successful request count: 2472 Avg request latency: 100887 usec (overhead 250 usec + queue 43205 usec + compute input 422 usec + compute infer 56942 usec + compute output 67 usec) Request concurrency: 16 Client: Request count: 2476 Throughput: 274.966 infer/sec p50 latency: 115776 usec p90 latency: 127253 usec p95 latency: 130799 usec p99 latency: 137202 usec Avg HTTP time: 116403 usec (send/recv 145 usec + response wait 116258 usec) Server: Inference count: 4952 Execution count: 619 Successful request count: 2476 Avg request latency: 115387 usec (overhead 205 usec + queue 57256 usec + compute input 430 usec + compute infer 57429 usec + compute output 66 usec) Inferences/Second vs. Client p95 Batch Latency Concurrency: 2, throughput: 132.291 infer/sec, latency 37684 usec Concurrency: 4, throughput: 145.275 infer/sec, latency 73463 usec Concurrency: 6, throughput: 186.767 infer/sec, latency 82845 usec Concurrency: 8, throughput: 221.592 infer/sec, latency 93610 usec Concurrency: 10, throughput: 248.444 infer/sec, latency 103645 usec Concurrency: 12, throughput: 275.839 infer/sec, latency 108260 usec Concurrency: 14, throughput: 274.518 infer/sec, latency 125339 usec Concurrency: 16, throughput: 274.966 infer/sec, latency 130799 usec ```