# Triton Documentation
Triton is a Inference Server enabling teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.
This is the tree summary of all Triton documents but [main documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html) is here.
## Triton Deployment Documentation
### Triton Tutorials
<details>
<summary>Details</summary>
This tutorial provide a quick start process for beginners to start using triton through many examples and do not goes into too much technical details.
- [Conceptual Guides](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide):
- [Part 1: Model deployment](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_1-model_deployment#model-configuration)
- [Part 2: Improving resources utilization](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization)
- [Part 3: Optimizing triton configuration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration)
- [Part 4: Inference acceleration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration)
- [Part 5: Model ensembles](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles)
- [Part 6: Building complex pipeline](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles)
- [Part 7: Iterative scheduling](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling)
</details>
<details>
<summary>Discussion</summary>
[Model max_batch_size](https://github.com/triton-inference-server/server/issues/3527)
</details>
### Triton Inference Server
<details>
<summary>User guide</summary>
[Architecture](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md)
[Decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md)
[Model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)
[Model management](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md)
[Model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)
[Ragged batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/ragged_batching.md)
[Optimization](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md)
[Perf analyzer](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md)
[Performance tuning](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/performance_tuning.md)
</details>
<details>
<summary>Customization guide</summary>
[Inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md)
[Repository Agent](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/repository_agents.md)
</details>
### Triton Client
<details>
<summary>Perf Analyzer</summary>
[Triton Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md)
</details>
<details>
<summary>GRPC Protocol</summary>
[Ensemble image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/ensemble_image_client.py)
[GRPC client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_client.py)
[GRPC byte content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_byte_content_client.py)
[GRPC explicit int8 content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int8_content_client.py)
[GRPC explicit int content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int_content_client.py)
[GPRC image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_image_client.py)
[Image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py)
[Memory growth test](https://github.com/triton-inference-server/client/blob/main/src/python/examples/memory_growth_test.py)
[Reuse infer objects client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/reuse_infer_objects_client.py)
[Simple GRPC AIO infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_infer_client.py)
[Simple GRPC AIO sequence stream](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_sequence_stream_infer_client.py)
[Simple GRPC async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_async_infer_client.py)
[Simple GRPC cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_cudashm_client.py)
[Simple GRPC custom args client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_args_client.py)
[Simple GRPC custom repeat](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_repeat.py)
[Simple GRPC health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_health_metadata.py)
[Simple GRPC infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_infer_client.py)
[Simple GRPC keepalive client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_keepalive_client.py)
[Simple GRPC model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_model_control.py)
[GRPC sequence stream infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_stream_infer_client.py)
[GRPC sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_sync_infer_client.py)
[Simple GRPC shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_client.py)
[Simple GRPC shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_string_client.py)
[Simple GRPC string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_string_infer_client.py)
</details>
<details>
<summary>HTTP Protocol</summary>
[Simple HTTP aio infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_aio_infer_client.py)
[Simple HTTP async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_async_infer_client.py)
[Simple HTTP cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py)
[Simple HTTP health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_health_metadata.py)
[Simple HTTP infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_infer_client.py)
[Simple HTTP model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_model_control.py)
[Simple HTTP sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_sequence_sync_infer_client.py)
[Simple HTTP shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_client.py)
[Simple HTTP shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_string_client.py)
[Simple HTTP string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_string_infer_client.py)
</details>
### Model Analyzer
<details>
<summary>Details</summary>
[Install](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md)
[CLI](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md)
[Config](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#objective)
[Config search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#quick-search-mode)
[Ensemble quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/ensemble_quick_start.md)
[Kubernets Deploy](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/kubernetes_deploy.md)
[Launch mode](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md)
[Metrics](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/metrics.md)
[Reports](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/report.md)
[Multi-model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/mm_quick_start.md)
[BLS model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/bls_quick_start.md)
[Checkpointing in Model Anlayzer](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/checkpoints.md)
</details>
### Model Navigator
An inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs. The Triton Model Navigator streamlines the process of moving models and pipelines implemented in PyTorch, TensorFlow, and/or ONNX to TensorRT.
<details>
<summary>Details</summary>
[Optimize Torch linear model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/01_optimize_torch_linear_model)
[Optimize and verify model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/02_optimize_and_verify_model)
[Python Profile function](https://github.com/triton-inference-server/model_navigator/tree/main/examples/21_nav_profile_python_function)
</details>
### Backends
A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing).
<details>
<summary>Python Backend</summary>
[Business logic scripting](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting)
[Add sub](https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub)
[BLS decoupled](https://github.com/triton-inference-server/python_backend/tree/main/examples/bls_decoupled)
[Custom metrics](https://github.com/triton-inference-server/python_backend/tree/main/examples/custom_metrics)
</details>
<details>
<summary>Onnx Runtime Backend</summary>
[ONNX runtime with TensoRT EP](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization)
[ONNX runtime with CUDA EP](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-cuda-execution-provider-optimization)
[ONNX runtime with OpenVino](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-openvino-optimization)
[Other optimization with ONNX](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#other-optimization-options-with-onnx-runtime)
</details>
<details>
<summary>TensorRT Backend</summary>
[Intro to Notebooks](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/IntroNotebooks)
[Semantic segmentation](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/SemanticSegmentation)
[Deploy to Triton](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton)
[Quantization tutorial](https://github.com/NVIDIA/TensorRT/blob/main/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb)
[Torch-TensorRT with Triton](https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html)
</details>
<details>
<summary>Dali Backend</summary>
[Training to inference](https://github.com/triton-inference-server/dali_backend/blob/main/docs/tutorials/training_to_inference.md)
Examples
[Dali plugin](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/dali_plugin)
[Efficient net](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/efficientnet)
[Inception ensemble](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/inception_ensemble)
[Perf Analyzer](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer)
[ResNet50 TRT](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/resnet50_trt)
</details>
<details>
<summary>Pytorch Backend</summary>
[Docs](https://github.com/triton-inference-server/pytorch_backend)
</details>
<details>
<summary>Paddle Paddle Backend</summary>
[Quick start](https://github.com/triton-inference-server/paddlepaddle_backend)
[Model configuration](https://github.com/triton-inference-server/paddlepaddle_backend/blob/main/docs/model_configuration.md)
[Examples](https://github.com/triton-inference-server/paddlepaddle_backend/tree/main/examples)
</details>
<details>
<summary>Fast Transformer Backend</summary>
[Faster Transformer](https://github.com/NVIDIA/FasterTransformer/)
[Faster Transformer backend](https://github.com/triton-inference-server/fastertransformer_backend)
</details>
### Pytriton
A Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python.
<details>
<summary>Quick Start</summary>
[Add sub notebook](https://github.com/triton-inference-server/pytriton/tree/main/examples/add_sub_notebook)
[Hugging face Resnet Pytorch](https://github.com/triton-inference-server/pytriton/tree/main/examples/huggingface_resnet_pytorch)
</details>
# Quick start
### Setup docker images
Pull nividia Triton server image
```bash!
docker pull nvcr.io/nvidia/tritonserver:24.06-py3
```
Pull nvidia Triton client for inference
```bash!
docker pull nvcr.io/nvidia/tritonserver:24.06-py3-sdk
```
### Triton server
Create and run container for Triton server
```bash!
docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v d/Documents/GitHub/MY-REPO/triton/model_repository:/models nvcr.io/nvidia/tritonserver:24.06-py3 tritonserver --model-repository=/models
```
Recommend using docker-compose file
```dockerfile!
services:
triton-server:
image: nvcr.io/nvidia/tritonserver:24.06-py3
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command: tritonserver --model-repository=/models --model-control-mode=explicit --load-model=densenet_onnx
ports:
- "8000:8000"
- "8001:8001"
- "8002:8002"
volumes:
- ../model_repository:/models
environment:
- NVIDIA_VISIBLE_DEVICES=1
```
Then enter bash or attach shell in vscode for tracking logging
```bash!
docker-compose ps
docker-compose exec triton-server bash
```
### Triton client
Create and run container for Triton client:
```bash!
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk
```
Then in the bash, run the premade file image_client
```bash!
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
```
However, we can also create our own client with own `http` or `grpc` protocols:
```python!
import numpy as np
import requests
import json
# Define the server URL
url = "http://localhost:8000/v2/models/densenet_onnx/infer"
# Create input data (example: an array of zeros)
input_data = np.zeros((3, 224, 224), dtype=np.float32)
# Prepare the data in JSON format
inputs = [
{
"name": "data_0",
"shape": input_data.shape,
"datatype": "FP32",
"data": input_data.tolist()
}
]
outputs = [
{
"name": "fc6_1"
}
]
request_payload = {
"inputs": inputs,
"outputs": outputs
}
# Send the request to the Triton server
response = requests.post(url, json=request_payload)
# Check the response status
if response.status_code == 200:
response_json = response.json()
print(response_json.keys())
output_data = np.array(response_json["outputs"][0]["data"]).reshape(response_json["outputs"][0]["shape"])
print("Output Data: ", output_data)
else:
print("Request failed with status code: ", response.status_code)
print("Response: ", response.text)
```
Then run this docker-compose.yml in client directory:
```dockerfile!
services:
triton-client:
image: nvcr.io/nvidia/tritonserver:24.06-py3-sdk
network_mode: host
tty: true
stdin_open: true
restart: unless-stopped
volumes:
- ../:/workspace/inference/
```
### Model analyzer
Create the output dir first to avoid error
```bash
mkdir output_model/output
```
Run the triton server with above docker compose file. And now run the container for it to automatically connect to triton server
```bash!
docker run -it --gpus all -v /var/run/docker.sock:/var/run/docker.sock -v d/Documents/GitHub/MY-REPO/triton/model_analyzer:/workspace/model_analyzer --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk
```
Now in the containter/machine, we run triton model analyzer with this:
```bash!
model-analyzer profile \
--model-repository /workspace/model_analyzer/ \
--profile-models densenet_onnx --triton-launch-mode=remote \
--output-model-repository-path /workspace/model_analyzer/model_output/output \
--export-path /workspace/model_analyzer/profile_results \
--override-output-model-repository
```
If you just want to test with limit experiments, use this:
```bash!
--run-config-search-max-concurrency 2
--run-config-search-max-model-batch-size 2
--run-config-search-max-instance-count 2
```
## Other savings
**Configuration of medical segmentation models**
```config!
name: "UNet_ImageCHD_128"
backend: "onnxruntime"
dynamic_batching { }
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ -1, 1, 128, 128 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ -1, 8, 128, 128 ]
}
]
instance_group [
{
kind: KIND_GPU
}
]
optimization {
execution_accelerators {
gpu_execution_accelerator: [
{
name: "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
parameters {
key: "max_workspace_size_bytes"
value: "4294967296"
}
parameters {
key: "trt_engine_cache_enable"
value: "true"
}
parameters {
key: "trt_engine_cache_path"
value: "/models/UNet_ImageCHD_128/1"
}
}
]
}
}
```
**Perf analyzer**
```bash!
perf-analyzer -m text_reg_batch -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95
```
**Perf with Dynamic Batching, Instance group 2, TensorRT acceleration on GPU RTX 3060**
```bash!
*** Measurement Settings ***
Batch size: 2
Service Kind: TRITON
Using "time_windows" mode for stabilization
Stabilizing using p95 latency
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 16 concurrent requests
Using synchronous calls for inference
Request concurrency: 2
Client:
Request count: 1191
Throughput: 132.291 infer/sec
p50 latency: 30573 usec
p90 latency: 35938 usec
p95 latency: 37684 usec
p99 latency: 40669 usec
Avg HTTP time: 30191 usec (send/recv 143 usec + response wait 30048 usec)
Server:
Inference count: 2384
Execution count: 1131
Successful request count: 1192
Avg request latency: 29519 usec (overhead 78 usec + queue 228 usec + compute input 101 usec + compute infer 29073 usec + compute output 38 usec)
Request concurrency: 4
Client:
Request count: 1308
Throughput: 145.275 infer/sec
p50 latency: 52808 usec
p90 latency: 67876 usec
p95 latency: 73463 usec
p99 latency: 83525 usec
Avg HTTP time: 64641 usec (send/recv 158 usec + response wait 64483 usec)
Server:
Inference count: 2616
Execution count: 947
Successful request count: 1308
Avg request latency: 63791 usec (overhead 115 usec + queue 15192 usec + compute input 210 usec + compute infer 48223 usec + compute output 50 usec)
Request concurrency: 6
Client:
Request count: 1682
Throughput: 186.767 infer/sec
p50 latency: 63495 usec
p90 latency: 78766 usec
p95 latency: 82845 usec
p99 latency: 91286 usec
Avg HTTP time: 64136 usec (send/recv 165 usec + response wait 63971 usec)
Server:
Inference count: 3364
Execution count: 846
Successful request count: 1682
Avg request latency: 63154 usec (overhead 147 usec + queue 19362 usec + compute input 315 usec + compute infer 43270 usec + compute output 59 usec)
Request concurrency: 8
Client:
Request count: 1995
Throughput: 221.592 infer/sec
p50 latency: 71015 usec
p90 latency: 87392 usec
p95 latency: 93610 usec
p99 latency: 104270 usec
Avg HTTP time: 72134 usec (send/recv 153 usec + response wait 71981 usec)
Server:
Inference count: 3990
Execution count: 747
Successful request count: 1995
Avg request latency: 71172 usec (overhead 179 usec + queue 21863 usec + compute input 364 usec + compute infer 48702 usec + compute output 63 usec)
Request concurrency: 10
Client:
Request count: 2237
Throughput: 248.444 infer/sec
p50 latency: 79531 usec
p90 latency: 98043 usec
p95 latency: 103645 usec
p99 latency: 113315 usec
Avg HTTP time: 80393 usec (send/recv 154 usec + response wait 80239 usec)
Server:
Inference count: 4474
Execution count: 680
Successful request count: 2237
Avg request latency: 79428 usec (overhead 195 usec + queue 25674 usec + compute input 443 usec + compute infer 53053 usec + compute output 61 usec)
Request concurrency: 12
Client:
Request count: 2484
Throughput: 275.839 infer/sec
p50 latency: 85227 usec
p90 latency: 102343 usec
p95 latency: 108260 usec
p99 latency: 119996 usec
Avg HTTP time: 86938 usec (send/recv 145 usec + response wait 86793 usec)
Server:
Inference count: 4968
Execution count: 624
Successful request count: 2484
Avg request latency: 85997 usec (overhead 228 usec + queue 28372 usec + compute input 458 usec + compute infer 56875 usec + compute output 63 usec)
Request concurrency: 14
Client:
Request count: 2472
Throughput: 274.518 infer/sec
p50 latency: 105608 usec
p90 latency: 122302 usec
p95 latency: 125339 usec
p99 latency: 132344 usec
Avg HTTP time: 101783 usec (send/recv 148 usec + response wait 101635 usec)
Server:
Inference count: 4944
Execution count: 623
Successful request count: 2472
Avg request latency: 100887 usec (overhead 250 usec + queue 43205 usec + compute input 422 usec + compute infer 56942 usec + compute output 67 usec)
Request concurrency: 16
Client:
Request count: 2476
Throughput: 274.966 infer/sec
p50 latency: 115776 usec
p90 latency: 127253 usec
p95 latency: 130799 usec
p99 latency: 137202 usec
Avg HTTP time: 116403 usec (send/recv 145 usec + response wait 116258 usec)
Server:
Inference count: 4952
Execution count: 619
Successful request count: 2476
Avg request latency: 115387 usec (overhead 205 usec + queue 57256 usec + compute input 430 usec + compute infer 57429 usec + compute output 66 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 132.291 infer/sec, latency 37684 usec
Concurrency: 4, throughput: 145.275 infer/sec, latency 73463 usec
Concurrency: 6, throughput: 186.767 infer/sec, latency 82845 usec
Concurrency: 8, throughput: 221.592 infer/sec, latency 93610 usec
Concurrency: 10, throughput: 248.444 infer/sec, latency 103645 usec
Concurrency: 12, throughput: 275.839 infer/sec, latency 108260 usec
Concurrency: 14, throughput: 274.518 infer/sec, latency 125339 usec
Concurrency: 16, throughput: 274.966 infer/sec, latency 130799 usec
```