Internship - Bao L. Q. Nguyen
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Triton Documentation Triton is a Inference Server enabling teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI. This is the tree summary of all Triton documents but [main documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html) is here. ## Triton Deployment Documentation ### Triton Tutorials <details> <summary>Details</summary> This tutorial provide a quick start process for beginners to start using triton through many examples and do not goes into too much technical details. - [Conceptual Guides](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide): - [Part 1: Model deployment](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_1-model_deployment#model-configuration) - [Part 2: Improving resources utilization](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization) - [Part 3: Optimizing triton configuration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration) - [Part 4: Inference acceleration](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration) - [Part 5: Model ensembles](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) - [Part 6: Building complex pipeline](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles) - [Part 7: Iterative scheduling](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling) </details> <details> <summary>Discussion</summary> [Model max_batch_size](https://github.com/triton-inference-server/server/issues/3527) </details> ### Triton Inference Server <details> <summary>User guide</summary> [Architecture](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md) [Decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md) [Model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) [Model management](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md) [Model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md) [Ragged batching](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/ragged_batching.md) [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/optimization.md) [Perf analyzer](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md) [Performance tuning](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/performance_tuning.md) </details> <details> <summary>Customization guide</summary> [Inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md) [Repository Agent](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/repository_agents.md) </details> ### Triton Client <details> <summary>Perf Analyzer</summary> [Triton Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) </details> <details> <summary>GRPC Protocol</summary> [Ensemble image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/ensemble_image_client.py) [GRPC client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_client.py) [GRPC byte content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_byte_content_client.py) [GRPC explicit int8 content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int8_content_client.py) [GRPC explicit int content client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_explicit_int_content_client.py) [GPRC image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/grpc_image_client.py) [Image client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py) [Memory growth test](https://github.com/triton-inference-server/client/blob/main/src/python/examples/memory_growth_test.py) [Reuse infer objects client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/reuse_infer_objects_client.py) [Simple GRPC AIO infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_infer_client.py) [Simple GRPC AIO sequence stream](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_aio_sequence_stream_infer_client.py) [Simple GRPC async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_async_infer_client.py) [Simple GRPC cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_cudashm_client.py) [Simple GRPC custom args client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_args_client.py) [Simple GRPC custom repeat](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_custom_repeat.py) [Simple GRPC health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_health_metadata.py) [Simple GRPC infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_infer_client.py) [Simple GRPC keepalive client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_keepalive_client.py) [Simple GRPC model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_model_control.py) [GRPC sequence stream infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_stream_infer_client.py) [GRPC sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_sequence_sync_infer_client.py) [Simple GRPC shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_client.py) [Simple GRPC shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_shm_string_client.py) [Simple GRPC string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_grpc_string_infer_client.py) </details> <details> <summary>HTTP Protocol</summary> [Simple HTTP aio infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_aio_infer_client.py) [Simple HTTP async infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_async_infer_client.py) [Simple HTTP cudashm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py) [Simple HTTP health metadata](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_health_metadata.py) [Simple HTTP infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_infer_client.py) [Simple HTTP model control](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_model_control.py) [Simple HTTP sequence sync infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_sequence_sync_infer_client.py) [Simple HTTP shm client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_client.py) [Simple HTTP shm string client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_shm_string_client.py) [Simple HTTP string infer client](https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_string_infer_client.py) </details> ### Model Analyzer <details> <summary>Details</summary> [Install](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/install.md) [CLI](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md) [Config](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config.md#objective) [Config search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md#quick-search-mode) [Ensemble quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/ensemble_quick_start.md) [Kubernets Deploy](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/kubernetes_deploy.md) [Launch mode](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/launch_modes.md) [Metrics](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/metrics.md) [Reports](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/report.md) [Multi-model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/mm_quick_start.md) [BLS model quick start](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/bls_quick_start.md) [Checkpointing in Model Anlayzer](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/checkpoints.md) </details> ### Model Navigator An inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs. The Triton Model Navigator streamlines the process of moving models and pipelines implemented in PyTorch, TensorFlow, and/or ONNX to TensorRT. <details> <summary>Details</summary> [Optimize Torch linear model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/01_optimize_torch_linear_model) [Optimize and verify model](https://github.com/triton-inference-server/model_navigator/tree/main/examples/02_optimize_and_verify_model) [Python Profile function](https://github.com/triton-inference-server/model_navigator/tree/main/examples/21_nav_profile_python_function) </details> ### Backends A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing). <details> <summary>Python Backend</summary> [Business logic scripting](https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#business-logic-scripting) [Add sub](https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub) [BLS decoupled](https://github.com/triton-inference-server/python_backend/tree/main/examples/bls_decoupled) [Custom metrics](https://github.com/triton-inference-server/python_backend/tree/main/examples/custom_metrics) </details> <details> <summary>Onnx Runtime Backend</summary> [ONNX runtime with TensoRT EP](https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization) [ONNX runtime with CUDA EP](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-cuda-execution-provider-optimization) [ONNX runtime with OpenVino](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#onnx-runtime-with-openvino-optimization) [Other optimization with ONNX](https://github.com/triton-inference-server/onnxruntime_backend?tab=readme-ov-file#other-optimization-options-with-onnx-runtime) </details> <details> <summary>TensorRT Backend</summary> [Intro to Notebooks](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/IntroNotebooks) [Semantic segmentation](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/SemanticSegmentation) [Deploy to Triton](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) [Quantization tutorial](https://github.com/NVIDIA/TensorRT/blob/main/quickstart/quantization_tutorial/qat-ptq-workflow.ipynb) [Torch-TensorRT with Triton](https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html) </details> <details> <summary>Dali Backend</summary> [Training to inference](https://github.com/triton-inference-server/dali_backend/blob/main/docs/tutorials/training_to_inference.md) Examples [Dali plugin](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/dali_plugin) [Efficient net](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/efficientnet) [Inception ensemble](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/inception_ensemble) [Perf Analyzer](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer) [ResNet50 TRT](https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/resnet50_trt) </details> <details> <summary>Pytorch Backend</summary> [Docs](https://github.com/triton-inference-server/pytorch_backend) </details> <details> <summary>Paddle Paddle Backend</summary> [Quick start](https://github.com/triton-inference-server/paddlepaddle_backend) [Model configuration](https://github.com/triton-inference-server/paddlepaddle_backend/blob/main/docs/model_configuration.md) [Examples](https://github.com/triton-inference-server/paddlepaddle_backend/tree/main/examples) </details> <details> <summary>Fast Transformer Backend</summary> [Faster Transformer](https://github.com/NVIDIA/FasterTransformer/) [Faster Transformer backend](https://github.com/triton-inference-server/fastertransformer_backend) </details> ### Pytriton A Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python. <details> <summary>Quick Start</summary> [Add sub notebook](https://github.com/triton-inference-server/pytriton/tree/main/examples/add_sub_notebook) [Hugging face Resnet Pytorch](https://github.com/triton-inference-server/pytriton/tree/main/examples/huggingface_resnet_pytorch) </details> # Quick start ### Setup docker images Pull nividia Triton server image ```bash! docker pull nvcr.io/nvidia/tritonserver:24.06-py3 ``` Pull nvidia Triton client for inference ```bash! docker pull nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` ### Triton server Create and run container for Triton server ```bash! docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v d/Documents/GitHub/MY-REPO/triton/model_repository:/models nvcr.io/nvidia/tritonserver:24.06-py3 tritonserver --model-repository=/models ``` Recommend using docker-compose file ```dockerfile! services: triton-server: image: nvcr.io/nvidia/tritonserver:24.06-py3 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] command: tritonserver --model-repository=/models --model-control-mode=explicit --load-model=densenet_onnx ports: - "8000:8000" - "8001:8001" - "8002:8002" volumes: - ../model_repository:/models environment: - NVIDIA_VISIBLE_DEVICES=1 ``` Then enter bash or attach shell in vscode for tracking logging ```bash! docker-compose ps docker-compose exec triton-server bash ``` ### Triton client Create and run container for Triton client: ```bash! docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` Then in the bash, run the premade file image_client ```bash! /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg ``` However, we can also create our own client with own `http` or `grpc` protocols: ```python! import numpy as np import requests import json # Define the server URL url = "http://localhost:8000/v2/models/densenet_onnx/infer" # Create input data (example: an array of zeros) input_data = np.zeros((3, 224, 224), dtype=np.float32) # Prepare the data in JSON format inputs = [ { "name": "data_0", "shape": input_data.shape, "datatype": "FP32", "data": input_data.tolist() } ] outputs = [ { "name": "fc6_1" } ] request_payload = { "inputs": inputs, "outputs": outputs } # Send the request to the Triton server response = requests.post(url, json=request_payload) # Check the response status if response.status_code == 200: response_json = response.json() print(response_json.keys()) output_data = np.array(response_json["outputs"][0]["data"]).reshape(response_json["outputs"][0]["shape"]) print("Output Data: ", output_data) else: print("Request failed with status code: ", response.status_code) print("Response: ", response.text) ``` Then run this docker-compose.yml in client directory: ```dockerfile! services: triton-client: image: nvcr.io/nvidia/tritonserver:24.06-py3-sdk network_mode: host tty: true stdin_open: true restart: unless-stopped volumes: - ../:/workspace/inference/ ``` ### Model analyzer Create the output dir first to avoid error ```bash mkdir output_model/output ``` Run the triton server with above docker compose file. And now run the container for it to automatically connect to triton server ```bash! docker run -it --gpus all -v /var/run/docker.sock:/var/run/docker.sock -v d/Documents/GitHub/MY-REPO/triton/model_analyzer:/workspace/model_analyzer --net=host nvcr.io/nvidia/tritonserver:24.06-py3-sdk ``` Now in the containter/machine, we run triton model analyzer with this: ```bash! model-analyzer profile \ --model-repository /workspace/model_analyzer/ \ --profile-models densenet_onnx --triton-launch-mode=remote \ --output-model-repository-path /workspace/model_analyzer/model_output/output \ --export-path /workspace/model_analyzer/profile_results \ --override-output-model-repository ``` If you just want to test with limit experiments, use this: ```bash! --run-config-search-max-concurrency 2 --run-config-search-max-model-batch-size 2 --run-config-search-max-instance-count 2 ``` ## Other savings **Configuration of medical segmentation models** ```config! name: "UNet_ImageCHD_128" backend: "onnxruntime" dynamic_batching { } input [ { name: "input" data_type: TYPE_FP32 dims: [ -1, 1, 128, 128 ] } ] output [ { name: "output" data_type: TYPE_FP32 dims: [ -1, 8, 128, 128 ] } ] instance_group [ { kind: KIND_GPU } ] optimization { execution_accelerators { gpu_execution_accelerator: [ { name: "tensorrt" parameters { key: "precision_mode" value: "FP16" } parameters { key: "max_workspace_size_bytes" value: "4294967296" } parameters { key: "trt_engine_cache_enable" value: "true" } parameters { key: "trt_engine_cache_path" value: "/models/UNet_ImageCHD_128/1" } } ] } } ``` **Perf analyzer** ```bash! perf-analyzer -m text_reg_batch -b 2 --shape input.1:1,32,100 --concurrency-range 2:16:2 --percentile=95 ``` **Perf with Dynamic Batching, Instance group 2, TensorRT acceleration on GPU RTX 3060** ```bash! *** Measurement Settings *** Batch size: 2 Service Kind: TRITON Using "time_windows" mode for stabilization Stabilizing using p95 latency Measurement window: 5000 msec Latency limit: 0 msec Concurrency limit: 16 concurrent requests Using synchronous calls for inference Request concurrency: 2 Client: Request count: 1191 Throughput: 132.291 infer/sec p50 latency: 30573 usec p90 latency: 35938 usec p95 latency: 37684 usec p99 latency: 40669 usec Avg HTTP time: 30191 usec (send/recv 143 usec + response wait 30048 usec) Server: Inference count: 2384 Execution count: 1131 Successful request count: 1192 Avg request latency: 29519 usec (overhead 78 usec + queue 228 usec + compute input 101 usec + compute infer 29073 usec + compute output 38 usec) Request concurrency: 4 Client: Request count: 1308 Throughput: 145.275 infer/sec p50 latency: 52808 usec p90 latency: 67876 usec p95 latency: 73463 usec p99 latency: 83525 usec Avg HTTP time: 64641 usec (send/recv 158 usec + response wait 64483 usec) Server: Inference count: 2616 Execution count: 947 Successful request count: 1308 Avg request latency: 63791 usec (overhead 115 usec + queue 15192 usec + compute input 210 usec + compute infer 48223 usec + compute output 50 usec) Request concurrency: 6 Client: Request count: 1682 Throughput: 186.767 infer/sec p50 latency: 63495 usec p90 latency: 78766 usec p95 latency: 82845 usec p99 latency: 91286 usec Avg HTTP time: 64136 usec (send/recv 165 usec + response wait 63971 usec) Server: Inference count: 3364 Execution count: 846 Successful request count: 1682 Avg request latency: 63154 usec (overhead 147 usec + queue 19362 usec + compute input 315 usec + compute infer 43270 usec + compute output 59 usec) Request concurrency: 8 Client: Request count: 1995 Throughput: 221.592 infer/sec p50 latency: 71015 usec p90 latency: 87392 usec p95 latency: 93610 usec p99 latency: 104270 usec Avg HTTP time: 72134 usec (send/recv 153 usec + response wait 71981 usec) Server: Inference count: 3990 Execution count: 747 Successful request count: 1995 Avg request latency: 71172 usec (overhead 179 usec + queue 21863 usec + compute input 364 usec + compute infer 48702 usec + compute output 63 usec) Request concurrency: 10 Client: Request count: 2237 Throughput: 248.444 infer/sec p50 latency: 79531 usec p90 latency: 98043 usec p95 latency: 103645 usec p99 latency: 113315 usec Avg HTTP time: 80393 usec (send/recv 154 usec + response wait 80239 usec) Server: Inference count: 4474 Execution count: 680 Successful request count: 2237 Avg request latency: 79428 usec (overhead 195 usec + queue 25674 usec + compute input 443 usec + compute infer 53053 usec + compute output 61 usec) Request concurrency: 12 Client: Request count: 2484 Throughput: 275.839 infer/sec p50 latency: 85227 usec p90 latency: 102343 usec p95 latency: 108260 usec p99 latency: 119996 usec Avg HTTP time: 86938 usec (send/recv 145 usec + response wait 86793 usec) Server: Inference count: 4968 Execution count: 624 Successful request count: 2484 Avg request latency: 85997 usec (overhead 228 usec + queue 28372 usec + compute input 458 usec + compute infer 56875 usec + compute output 63 usec) Request concurrency: 14 Client: Request count: 2472 Throughput: 274.518 infer/sec p50 latency: 105608 usec p90 latency: 122302 usec p95 latency: 125339 usec p99 latency: 132344 usec Avg HTTP time: 101783 usec (send/recv 148 usec + response wait 101635 usec) Server: Inference count: 4944 Execution count: 623 Successful request count: 2472 Avg request latency: 100887 usec (overhead 250 usec + queue 43205 usec + compute input 422 usec + compute infer 56942 usec + compute output 67 usec) Request concurrency: 16 Client: Request count: 2476 Throughput: 274.966 infer/sec p50 latency: 115776 usec p90 latency: 127253 usec p95 latency: 130799 usec p99 latency: 137202 usec Avg HTTP time: 116403 usec (send/recv 145 usec + response wait 116258 usec) Server: Inference count: 4952 Execution count: 619 Successful request count: 2476 Avg request latency: 115387 usec (overhead 205 usec + queue 57256 usec + compute input 430 usec + compute infer 57429 usec + compute output 66 usec) Inferences/Second vs. Client p95 Batch Latency Concurrency: 2, throughput: 132.291 infer/sec, latency 37684 usec Concurrency: 4, throughput: 145.275 infer/sec, latency 73463 usec Concurrency: 6, throughput: 186.767 infer/sec, latency 82845 usec Concurrency: 8, throughput: 221.592 infer/sec, latency 93610 usec Concurrency: 10, throughput: 248.444 infer/sec, latency 103645 usec Concurrency: 12, throughput: 275.839 infer/sec, latency 108260 usec Concurrency: 14, throughput: 274.518 infer/sec, latency 125339 usec Concurrency: 16, throughput: 274.966 infer/sec, latency 130799 usec ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully