[Cloud / Tool] Triton Server

[Cloud / Tool] Triton Server === > [[雲端]](/1s3Doa57Rk6cgEZdYBWrVQ) ###### tags: `Cloud / Tool` ###### tags: `Cloud`, `Tool`, `Triton`, `Triton Server` [TOC] > 測試平台：ESC4000 :::info :information_source: **Triton 目前要專注的目標：** - https://github.com/triton-inference-server/python_backend 這篇提到怎麼讓 Triton Server 載入 python model, 提供推論服務以及有 client.py 示範如何執行 inference - https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html 提到的 support-matrix 它列出了不同版本的 Triton Server, 所需要的 driver / cuda version - https://github.com/triton-inference-server/fil_backend 有個 FIL backend, 可以吃 XGBoost, LightGBM, Scikit-Learn, and cuML 模型 - https://github.com/triton-inference-server/server/tree/main/docs/protocol triton 除了支援標準的 inference protocol (KFServing community standard inference protocols), 它同時也提供了幾個extension. 或許可以從這裡著手 Classification extension: TJ Tsai(蔡宗容) 幫忙試試看這個, 用它來打 victor 架設的分類模型 - [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching) ::: ## 簡介 - ### [[官方] 利用 NVIDIA Triton 2.3 簡化和擴充推論服務](https://blogs.nvidia.com.tw/2021/01/26/simplifying-and-scaling-inference-serving-with-triton-2-3/) - ### [Triton Inference Server - 简化手册](https://zhuanlan.zhihu.com/p/366555962) ## GitHub > [triton-inference-server](https://github.com/triton-inference-server) ### [server](https://github.com/triton-inference-server/server) - [server/docs/protocol/](https://github.com/triton-inference-server/server/tree/main/docs/protocol) - [Classification extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md) <hr> ## ==Dashboard (Grafana)== http://10.78.26.241:3800/  ## ==server== ### docker commnad > 基本指令 ```bash= export models=/home/diatango_lin/tj_tsai/workspace/infra/triton_server/models docker run \ --gpus=1 \ --rm \ -p 9000:8000 -p 9001:8001 -p 9002:8002 \ -v $models:/models \ nvcr.io/nvidia/tritonserver:21.06-py3 \ tritonserver \ --model-store=/models \ --strict-model-config=false ``` - ### 指令來源 - [server/docs/quickstart.md - Quickstart](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md) - [Run on System with GPUs](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#run-on-system-with-gpus) - ### 參數探討 - port - Port 8000 為 HTTP 協定通道 - Port 8001 為 GRPC 協定通道 - Port 8002 為 metric 資源 - [tritonserver 启动参数](https://zhuanlan.zhihu.com/p/366555962) - **GPU 參數** `--cuda-memory-pool-byte-size=0:6442450944` (第0張，配置 6GB) ![](https://i.imgur.com/wOlS6Bj.png) 但 metrics 都維持 2986 GB (3131047936B) :::warning **Virtor 補充：** https://github.com/triton-inference-server/server/blob/main/src/servers/main.cc#L473 trition 啟動能帶的參數沒有限制 GPU memory的，最接近的cuda-memory-pool-byte-size也是設定每次跟CUDA要的Memory pool size而已 ::: - **模型來源** `--model-store=<模型資料夾>` `--model-repository=<模型資料夾>` 兩個參數用法皆相同 **Step1**: 透過 `-v` 掛載主機端資料夾進來 **Step2**: 在跟 Triton 說 model 根目錄在哪 - **模型組態/模型設定檔** > 如果为真，则必须提供模型配置文件，并且必须指定所有必要的配置设置。如果为假，则模型配置可能不存在或仅部分指定，服务器将尝试推导出缺少的所需配置。 - [[Azure] 定義模型設定檔](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=azcli) [![](https://i.imgur.com/c4A4uNu.png)](https://i.imgur.com/c4A4uNu.png) ### 版本選擇 - [21.xx Framework Containers Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html) - Container Image 版本 - 目前在 ESC4000 上測試的版本：21.06 - 2021/08/19 查閱: > GeForce GTX 1080 Ti 已經可支援到 470 > ![](https://i.imgur.com/twYYFpX.png) > - SUPPORTED PRODUCTS > **GeForce 10 Series:** > GeForce GTX 1080 Ti, GeForce GTX 1080, GeForce GTX 1070 Ti, GeForce GTX 1070, GeForce GTX 1060, GeForce GTX 1050 Ti, GeForce GTX 1050, GeForce GT 1030, GeForce GT 1010 ### GPU 設定 - `--gpus=1` 配置第 0 張 ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 541915 C tritonserver 199MiB | +-----------------------------------------------------------------------------+ ``` - **指定不存在的 GPU** i.e. 配置 GPU-0，卻選 GPU-1 來執行 ```json= instance_group [ { count: 6 kind: KIND_GPU gpus: [ 1 ] <--- } ] ``` 透過 API 去 load model 時，會遇到 error： > {"error":"failed to load 'resnet18_onnx', > no version is available"} - `--gpus=3` 配置第 0,1,2 張 ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 537186 C tritonserver 199MiB | | 1 N/A N/A 537186 C tritonserver 199MiB | | 2 N/A N/A 537186 C tritonserver 199MiB | +-----------------------------------------------------------------------------+ ``` ### Triton 架構 - [Triton Architecture](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#models-and-schedulers) <hr> ## ==Client== ### Quick Start - ### Step1: 先啟動 TritonServer ```bash= export MODELS=/home/diatango_lin/tj_tsai/workspace/infra/triton_server/workspace/models docker run \ --gpus=2 \ --rm \ -d \ -p 9000:8000 -p 9001:8001 -p 9002:8002 \ -v $MODELS:/models \ nvcr.io/nvidia/tritonserver:21.06-py3 \ tritonserver \ --model-store=/models \ --strict-model-config=false \ --model-control-mode=explicit ``` - ### 容器管理 - `-d`: daemon 跑在背景，與 terminal 解耦合 - 再透過 `docker logs` 持續擷取 log `$ docker logs -f 4d81f08a5770` - `-f, --follow`: Follow log output - ### 模型管理 - 模型來源 `--model-store=<模型資料夾>` `--model-repository=<模型資料夾>` - ```--model-control-mode=explicit``` - 允許動態載入/卸載模型 - ### Step2: 連線進入 TritonServer ```bash= $ export CONTAINER_ID=4d81f08a5770 # 範例 $ docker exec -it $CONTAINER_ID bash # ---------------------------------- # 進入到容器，當前目錄在 /opt/tritonserver# # 下載 python-based model 範例 $ git clone https://github.com/triton-inference-server/python_backend -b r21.06 $ cd python_backend # 複製範例程式到 /models/add_sub 底下 $ mkdir -p /models/add_sub/1/ $ cp examples/add_sub/model.py /models/add_sub/1/model.py $ cp examples/add_sub/config.pbtxt /models/add_sub/config.pbtxt # 檢視 /models 目錄 $ apt update && apt install tree $ tree /models /models `-- add_sub |-- 1 | |-- __pycache__ | | `-- model.cpython-38.pyc | `-- model.py `-- config.pbtxt 3 directories, 3 files ``` 在另外的 terminal ，即可即時查到 ```bash= $ curl -v -X POST $URL/v2/repository/index | jq ``` - ### Step3: 起另一個 docker 環境，作為 client ```bash= # 使用預先裝好 python 相關套件的 docker $ docker run \ --name triton_client \ --rm -it \ --net host \ nvcr.io/nvidia/tritonserver:21.06-py3-sdk \ /bin/bash # 下載 python-based 的 client $ git clone https://github.com/triton-inference-server/python_backend -b r21.06 # 執行範例程式 $ python3 python_backend/examples/add_sub/client.py ``` - :::spoiler `pip list` ``` root@stage-kube01:/workspace# pip list Package Version --------------------- -------------------- attrdict 2.0.1 certifi 2019.11.28 cffi 1.14.5 chardet 3.0.4 cryptography 3.4.7 cycler 0.10.0 dbus-python 1.2.16 distro 1.5.0 docker 5.0.0 gevent 21.1.2 geventhttpclient 1.4.4 greenlet 1.1.0 grpcio 1.38.0 grpcio-tools 1.38.0 httplib2 0.19.1 idna 2.8 kiwisolver 1.3.1 llvmlite 0.36.0 matplotlib 3.4.2 numba 0.53.1 numpy 1.20.3 pdfkit 0.6.1 Pillow 8.2.0 pip 21.1.2 prometheus-client 0.11.0 protobuf 3.17.3 psutil 5.8.0 pycparser 2.20 PyGObject 3.36.0 pyparsing 2.4.7 python-apt 2.0.0+ubuntu0.20.4.5 python-dateutil 2.8.1 python-rapidjson 1.0 PyYAML 5.4.1 requests 2.25.1 requests-unixsocket 0.2.0 setuptools 57.0.0 six 1.14.0 triton-model-analyzer 1.5.0 tritonclient 2.11.0 urllib3 1.25.8 websocket-client 1.1.0 wheel 0.36.2 zope.event 4.5.0 zope.interface 5.4.0 ``` ::: <hr> ## ==curl== ### jupyter notebook - http://10.78.26.241:48888/?token=a10d6a3470df3a77b328f9975d63e69ace8661bee127f474 ### 查看 Triton Server 版本 ```bash= $ curl $URL/v2 | jq # v2 後面不能加斜線(/)，否則會得到 400 Bad Request ``` ```json= { "name": "triton", "version": "2.11.0", "extensions": [ "classification", "sequence", "model_repository", "model_repository(unload_dependents)", "schedule_policy", "model_configuration", "system_shared_memory", "cuda_shared_memory", "binary_tensor_data", "statistics" ] } ``` ### 查看 Triton Server 是否在運作 - [[指令來源] Verify Triton Is Running Correctly](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#verify-triton-is-running-correctly) ```bash= $ curl -v $URL/v2/health/ready ``` ```= * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 8000 (#0) > GET /v2/health/ready HTTP/1.1 > Host: 10.78.26.241:8000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK <--- < Content-Length: 0 < Content-Type: text/plain < * Connection #0 to host 10.78.26.241 left intact ``` ### 查看 Triton Server metrics(度量指標) > #grafana (將節點上 metrics 資訊傳給 Grafana，在儀表版上呈現) > #prometheus (對運算結點進行監控) > #計數 #計量 #度量 #指標 #效能 #利用率 #監視 #監控 ```bash= # 注意：port 是 8002 $ curl 10.78.26.241:8002/metrics ``` ```json= # HELP nv_inference_request_success Number of successful inference requests, all batch sizes # TYPE nv_inference_request_success counter nv_inference_request_success{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000 nv_inference_request_success{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000 ... ... # HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started # TYPE nv_energy_consumption counter nv_energy_consumption{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 818303.081000 nv_energy_consumption{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 1140971.888000 ``` - ### 怎麼看 - ### 推論請求：成功次數 > #inference_request_success ``` # HELP nv_inference_request_success Number of successful inference requests, all batch sizes # TYPE nv_inference_request_success counter nv_inference_request_success{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000 nv_inference_request_success{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000 ``` - ### 推論請求：失敗次數 > #inference_request_failure ``` # HELP nv_inference_request_failure Number of failed inference requests, all batch sizes # TYPE nv_inference_request_failure counter nv_inference_request_failure{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000 nv_inference_request_failure{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000 ``` - ### GPU 使用率 > #gpu_utilization ``` # HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0) # TYPE nv_gpu_utilization gauge nv_gpu_utilization{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 0.000000 nv_gpu_utilization{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 0.000000 ``` 0.0 表示沒在使用 - ### GPU 記憶體使用量 > #gpu_memory_used_bytes ``` # HELP nv_gpu_memory_used_bytes GPU used memory, in bytes # TYPE nv_gpu_memory_used_bytes gauge nv_gpu_memory_used_bytes{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 1271922688.000000 nv_gpu_memory_used_bytes{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 1150287872.000000 ``` - 分別使用了 1.27, 1.15 GB - ### GPU 電力使用量 (單位：瓦特) > #gpu_power_usage ``` # HELP nv_gpu_power_usage GPU power usage in watts # TYPE nv_gpu_power_usage gauge nv_gpu_power_usage{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 8.024000 nv_gpu_power_usage{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 8.311000 ``` - 分別使用了 8.0, 8.3 Walt - ### doc: [server/docs/metrics.md](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md) - ### 依測量值分類 - 計數器(Counter)：0, 1, 2, 3, ... (只增不減) - 計量器(Gauge)：[0.0, 1.0], [0.0, X] (區間震盪) - 等待時間(Latency)：10ms - 直方圖(Histogram)：min/max, mean, medium, quantiles - 事件發生頻率...等等 - ### 依官方分類 - [Gauge] GPU Utilization - 瞬間功率(電力使用) / 最大功率 - 能源消耗、GPU 利用率 - [Gauge] GPU Memory - 已使用 / 全部 - Count - 請求次數、推論次數、執行次數 - Latency - 請求時間 = 排隊時間 + 計算時間 - 範例1：單次 image 推論 | time | us | |------|----| | request_duration_us | 2042477 | | queue_duration_us | 1534 | | compute_input_duration_us | 932 | | compute_infer_duration_us | 2039865 | | compute_output_duration_us | 33 | - 排隊時間 = 1534 - 計算時間 = 932 + 2039865 + 33 = 2040830 - 請求時間 = 1534 + 2040830 = 2042364 (2042477) - 範例2：千次 image 推論 | time | us | |------|----| | request_duration_us | 20403631 | | queue_duration_us | 3451709 | | compute_input_duration_us | 660467 | | compute_infer_duration_us | 16000553 | | compute_output_duration_us | 43172 | - 過濾資料的 expr find: ```regexp ^[\s|\d]*nv_inference_(\w+){.+}\s*(\d+).* ``` replace: ``` \1: \2 ``` ### 查看 Triton Server 上的 model 清單 ```bash= $ curl -v -X POST $URL/v2/repository/index | jq ``` - 執行結果 :::spoiler `curl -v -X POST $URL/v2/repository/index | jq` ```= * Trying 10.78.26.241... > POST /v2/repository/index HTTP/1.1 > Host: 10.78.26.241:8000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 98 < * Connection #0 to host 10.78.26.241 left intact [ { "name": "add_sub", "version": "1", "state": "READY" }, { "name": "resnet", "version": "1", "state": "READY" } ] ``` ::: - 補充說明 - 即使還沒載入，也會出現在清單上 > even if it is not currently loaded into Triton ### [模型管理] 動態載入模型/卸載模型 - ### [API介紹] [server/docs/protocol/extension_model_repository.md](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md) > allows a client to query and control the one or more model repositories being served by Triton > 允許客戶端**查詢**和**控制**由 Triton 提供的一個或多個模型儲存庫 ``` POST v2/repository/index POST v2/repository/models/${MODEL_NAME}/load POST v2/repository/models/${MODEL_NAME}/unload ``` - 執行結果 :::warning :warning: **摘要：資料變化** model 根目錄下，有先預放 add_sub model 相關資料 - **STEP 1: index** ```[{"name":"add_sub"}]``` - **STEP 2: load** - **STEP 3: index** ```[{"name":"add_sub","version":"1","state":"READY"}]``` - **STEP 4: unload** - **STEP 5: index** ```[{"name":"add_sub","version":"1","state":"UNAVAILABLE","reason":"unloaded"}]``` - **STEP 6: load** - **STEP 7: index** ```[{"name":"add_sub","version":"1","state":"READY"}]``` ::: :::spoiler `curl -v -X POST $URL/v2/repository/index` ``` * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0) > POST /v2/repository/index HTTP/1.1 > Host: 10.78.26.241:9000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 76 < * Connection #0 to host 10.78.26.241 left intact [{"name":"add_sub"}] ::: :::spoiler `curl -v -X POST $URL/v2/repository/models/add_sub/load` ``` * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0) > POST /v2/repository/models/add_sub/load HTTP/1.1 > Host: 10.78.26.241:9000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 0 < * Connection #0 to host 10.78.26.241 left intact ``` ::: :::spoiler `curl -v -X POST $URL/v2/repository/index` ``` * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0) > POST /v2/repository/index HTTP/1.1 > Host: 10.78.26.241:9000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 50 < * Connection #0 to host 10.78.26.241 left intact [{"name":"add_sub","version":"1","state":"READY"}] ``` ::: :::spoiler `curl -v -X POST $URL/v2/repository/models/add_sub/unload` ``` * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0) > POST /v2/repository/models/add_sub/unload HTTP/1.1 > Host: 10.78.26.241:9000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 0 < * Connection #0 to host 10.78.26.241 left intact ``` ::: :::spoiler `curl -v -X POST $URL/v2/repository/index` ``` * Trying 10.78.26.241... * Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0) > POST /v2/repository/index HTTP/1.1 > Host: 10.78.26.241:9000 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < Content-Length: 76 < * Connection #0 to host 10.78.26.241 left intact [{"name":"add_sub","version":"1","state":"UNAVAILABLE","reason":"unloaded"}] ``` ::: - 補充說明 - 執行 unload 指令後，再查詢當前的 model config 會有錯誤訊息 :::spoiler `curl -v $URL/v2/models/add_sub/config` ``` {"error":"Request for unknown model: 'add_sub' has no available versions"} ``` ::: - :warning: 如果 model 有開檔行為（即使都沒有變更）沒執行 unload 前，就執行 load 這個 model 就死當了 - 其他不明原因無法載入 model :::spoiler Server log ``` python.cc:1489] TRITONBACKEND_ModelInstanceInitialize: add_sub_0 (CPU device 0) python.cc:600] Stub process successfully restarted. python.cc:373] Failed to process the batch of requests. python.cc:603] Stub process failed to restart. Your future requests to model add_sub_0 will fail. Error: Failed to initialize stub, stub process exited unexpectedly: add_sub_0 ``` ::: :::spoiler `! curl -v -X POST $URL/v2/repository/models/add_sub/load` ```json= {"error":"load failed for model 'add_sub' : version 1 : Internal : Unable to initialize shared memory key '/add_sub_0_CPU_0' to requested size (67108864 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 64MBs of shared memory. Flag '--shm-size=5G' should be sufficient for common usecases. Error: No such file or directory;\n"} ``` ::: - ### [參數說明] [server/docs/model_management.md](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md) - ### [Model Control Mode NONE](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) (**啟動就不可變更**) `--model-control-mode=none` - 啟動時，Server 會預設載入所有 model - model repository 不可變更 - load/unload API 會有錯誤訊息 `{"error":"explicit model load / unload is not allowed if polling is enabled"}` - [Error when load model with http api #2633](https://github.com/triton-inference-server/server/issues/2633) - ### [Model Control Mode EXPLICIT](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) (**全由 API 操作**) `--model-control-mode=explicit` - Server 不會自動載入 - `--load-model command-line` 指定預載入 model - 很明確地，由 API 進行 load / unload - ### [Model Control Mode POLL](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-poll) (**定時掃描**) ``` --model-control-mode=poll --repository-poll-secs=5 ``` - 啟動時，Server 會預設載入所有 model - 有新 model 加入時，在掃瞄時會自動加入 - 掃瞄間隔，由 `--repository-poll-secs` 控制，單位秒數 - ### TritonServer 啟動參數 - `--model-control-mode <string>` > Specify the mode for model management. Options are "none", "poll" and "explicit". The default is "none". > > For "none", the server will load all models in the model repository(s) at startup and will not make any changes to the load models after that. > > For "poll", the server will poll the model repository(s) to detect changes and will load/unload models based on those changes. The poll rate is controlled by 'repository-poll-secs'. > > For "explicit", model load and unload is initiated by using the model control APIs, and only models specified with --load-model will be loaded at startup. - `--repository-poll-secs <integer>` > Interval in seconds between each poll of the model repository to check for changes. Valid only when --model-control-mode=poll is specified. <hr> ## ==模型組態== ### 多模型多、多版本架構 ``` models/ └── add_sub ├── 1 │ ├── model.py │ └── __pycache__ │ └── model.cpython-38.pyc ├── 2 │ ├── model.py │ └── __pycache__ │ └── model.cpython-38.pyc ├── 3 │ ├── model.py │ └── __pycache__ │ └── model.cpython-38.pyc └── config.pbtxt ``` - Model Repository - [server/docs/model_repository.md](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md) - Model Configuration ([中文翻譯1](https://blog.csdn.net/sophia_xw/article/details/107009697), [中文翻譯2](https://www.twblogs.net/a/5eff6fb041b2036d50959650)) - [server/docs/model_configuration.md](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy) - [Minimal Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#minimal-model-configuration) - [Datatypes](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#datatypes) - [Version Policy](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy) - [Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy) - [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching) - [Warmup](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-warmup) ### config 可自行生成 - [Auto-Generated Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#auto-generated-model-configuration) - 範例 ``` $ curl -v $URL/v2/models/resnet18_onnx/versions/1/config | jq ``` ### 模型有多個版本 > #ModelVersionPolicy - [Version Policy](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy) - **all** `version_policy: { all { }}` - **specific** `version_policy: { specific { versions: [2] }}` - 如果有 v1, v2, v3，將只會載入 v2 - **latest** `version_policy: { latest { num_versions : 2 }}` - 如果有 v1, v2, v3，將只會載入最新的 2 個 (v2 & v3) ### 模型有多個實體 - ### [Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups) ```json= instance_group [ { count: 1 kind: KIND_GPU gpus: [ 0 ] }, { count: 2 kind: KIND_GPU gpus: [ 1, 2 ] } ] ``` - 在 GPU 0 上放置一個執行實體 - 在 GPU 1 和 2 上放置兩個執行實體 - ### [Concurrent Model Execution](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#models-and-schedulers) - ### 單一實體 ![](https://i.imgur.com/nMNWoTy.png) - ### 三個實體 ![](https://i.imgur.com/tfKsxtE.png) ### 批次處理 - [[Model Configuration] Maximum Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#maximum-batch-size) ``` max_batch_size: 8 ``` - 對於不支援批次處理的模型，`max_batch_size` 必須設為 0 - `max_batch_size` 為 1 表示每次推論，最多只會處理 1 次 request - `max_batch_size` 為 2 表示每次推論，最多可以一次處理 2 次 request - [[Model Configuration] Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching) ``` dynamic_batching { preferred_batch_size: [ 4, 8 ] max_queue_delay_microseconds: 100 } ``` - [[Nvidia][Performance] Batching techniques: Dynamic Batching](https://ngc.nvidia.com/catalog/resources/nvidia:jasper_for_trtis/performance) ![](https://i.imgur.com/wZlN4Rp.png) ![](https://i.imgur.com/68yCmZp.png) - [缺點] At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds. 在並發請求數量較少時，增加的吞吐量是以增加延遲為代價的，因為請求排隊到 max_queue_delay_microseconds。 - ### resnet18_onnx 試驗 - max_batch_size : 4 or 8 or 12 - ### 輸入＆輸出張量維度 > [Maximum Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#maximum-batch-size) > > If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its **dynamic batcher** or **sequence batcher** to automatically use batching with the model. > > > > In this case ++**`max_batch_size` should be set to a value greater-or-equal-to 1**++ that indicates the maximum batch size that Triton should use with the model. > > keywords: #max_batch_size - `max_batch_size` > 0 ```json platform: "tensorrt_plan" max_batch_size: 8 input [ { name: "input0" data_type: TYPE_FP32 dims: [ 16 ] } ] output [ { name: "output0" data_type: TYPE_FP32 dims: [ 4 ] } ] ``` - input0.shape 為 [ -1, 16 ] - output0.shape 為 [ -1, 4 ] - `max_batch_size` = 0 ``` platform: "tensorrt_plan" max_batch_size: 0 input [ { name: "input0" data_type: TYPE_FP32 dims: [ 16 ] } ] output [ { name: "output0" data_type: TYPE_FP32 dims: [ 4 ] } ] ``` - input0.shape 為 [ 16 ] - output0.shape 為 [ 4 ] - POST `v2/repository/models/resnet18_onnx/load` > { > "error":"load failed for model 'resnet18_onnx' > : version 1 > : Invalid argument > : model 'resnet18_onnx', > > tensor 'data': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,3,224,224];\n"} - [debug] GET `v2/models/resnet18_onnx/config` > { > "error": "Request for unknown model: 'resnet18_onnx' has no available versions" > } - [debug] POST `v2/repository/index` > [{ > "name": "resnet18_onnx", > "version": "1", > "state": "UNAVAILABLE", > "reason": "Invalid argument: model 'resnet18_onnx', tensor 'data': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,3,224,224]" > }] - ### inception_graphdef > [Batching techniques: Dynamic Batching](https://ngc.nvidia.com/catalog/resources/nvidia:jasper_for_trtis/performance) > You can set the Dynamic Batcher parameter max_queue_delay_microseconds to indicate the maximum amount of time you are willing to wait and preferred_batch_size to indicate your maximum server batch size in the Triton Inference Server model config. - ### 總表 **batch_size**: `preferred_batch_size` **delay**: `max_queue_delay_microseconds` **infer**: `compute_infer_duration_us` | batch_size | delay(ms) | exec_count | infer_duration(ms) | | --------------| ----:| ----:| ---------:| | 1 | 0 | 1000 | 11613.672 | | 1 | 300 | 1000 | 14089.080 | | 1,2,3,4,5,6,7 | 3 | 864 | 22666.319 | | 1,2,3,4,5,6,7 | 30 | 786 | 18522.102 | | 1,2,3,4,5,6,7 | 300 | 851 | 13347.668 | | 1,2,3,4,5,6,7 | 900 | 833 | 14471.256 | | 1,2,3,4,5,6,7 | 1500 | 823 | 14108.980 | | 16,32,64 | 3 | 828 | 17609.888 | | 16,32,64 | 30 | 796 | 16468.601 | | 16,32,64 | 300 | 773 | 17444.561 | | 16,32,64 | 3000 | 762 | 16688.968 | | 16,32,64 |30000 | 485 | 19989.942 | | 128, 192, 256 | 3 | 833 | 17901.207 | | 128, 192, 256 | 30 | 774 | 15446.807 | | 128, 192, 256 | 300 | 843 | 16350.169 | | 128, 192, 256 | 900 | 822 | 13331.119 | | 128, 192, 256 | 1500 | 112 | 4801102.823 (不正常) | 128, 192, 256 | 1500 | 795 | 17051.370 | | 1000 | 30000 | 485 | 19989.942 | | 1024 | 30000 | 467 | 19315.535 | - 偶爾會有偶發性異常的指標，上面給予一個範例 - 甚至 exec_count 比 request_success 還要多 - batch_size 還是受限於 GPU Memory > 2021-08-26 08:52:17.358634: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_ops.cc:504 : **Resource exhausted: OOM** when allocating tensor with **shape[1000,64,147,147]** and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc - 千次 request，上傳影像就要花 21.623s - 把 `max_batch_size` 從 128 改成 0 會有錯誤 ``` inference failed: 2 root error(s) found. (0) Invalid argument: transpose expects a vector of size 3. But input(1) is a vector of size 4 [[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/BatchNorm/batchnorm/mul-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] (1) Invalid argument: transpose expects a vector of size 3. But input(1) is a vector of size 4 [[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/BatchNorm/batchnorm/mul-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] [[InceptionV3/Predictions/Softmax/_3]] 0 successful operations. 0 derived errors ignored. ``` - 再把 `max_batch_size` 從 0 改成 1 能正常運作 ``` Request 1, batch size 1 0.826453 (505) = COFFEE MUG PASS ``` - preferred_batch_size 必須為正整數 > **Server log:** > dynamic batching preferred size must be positive for inception_graphdef - delay 設為 0，還是有 batch 效果 ``` max_batch_size: 1 dynamic_batching { preferred_batch_size: [ 1] max_queue_delay_microseconds: 0 } ``` - ### max_batch_size: 128, delay=3ms (3000µs) ``` max_batch_size: 128 dynamic_batching { preferred_batch_size: [ 1,2,4,8,16,32,64,128 ] max_queue_delay_microseconds: 3000 } ``` ``` # 第 1 次執行 (避免冷啟動，先熱身) (但第1次和第2次，看起來沒差異) request_success: 1 request_failure: 0 count: 1 exec_count: 1 request_duration_us: 2090404 queue_duration_us: 919 compute_input_duration_us: 306 compute_infer_duration_us: 2089016 compute_output_duration_us: 35 elapsed_time: 0.13400506973266602 ==================================== # 第 2 次執行 request_success: 2 request_failure: 0 count: 2 exec_count: 2 request_duration_us: 2119275 queue_duration_us: 1012 compute_input_duration_us: 616 compute_infer_duration_us: 2117238 compute_output_duration_us: 72 elapsed_time: 0.1343550682067871 ==================================== # 第 1002 次執行 (共執行 1000 次) request_success: 1002 request_failure: 0 count: 1002 exec_count: 884 request_duration_us: 17447609 queue_duration_us: 2441652 compute_input_duration_us: 411897 compute_infer_duration_us: 14360615 compute_output_duration_us: 40784 elapsed_time: 27.938815116882324 ``` - 單次推論，需 2ms; 884 次推論，起碼要 1768ms？為何只有 14ms ？ - `elapsed_time` 原比 `request_duration_us` 要多猜測可能是 image 上傳到 Triton Server 比較費工 - client 端千次執行，五次平均約需：30.17s - (27.939 + 28.193 + 29.293 + 32.706 + 32.719) / 5 - client 端的感受：推論一次約需 30.17ms - ### max_batch_size: 128, delay=30ms (30000µs) ``` # 第 1002 次執行 (共執行 1000 次) request_success: 1002 request_failure: 0 count: 1002 exec_count: 852 request_duration_us: 22679764 queue_duration_us: 4400009 compute_input_duration_us: 620710 compute_infer_duration_us: 17345940 compute_output_duration_us: 46439 elapsed time: 30.09218406677246 ``` - ### max_batch_size: 128, delay=300ms (300000µs) ``` # 第 1 次執行 (避免冷啟動，先熱身) (但第1次和第2次，看起來沒差異) request_success: 1 request_failure: 0 count: 1 exec_count: 1 request_duration_us: 2004857 queue_duration_us: 1541 compute_input_duration_us: 798 compute_infer_duration_us: 2002351 compute_output_duration_us: 28 elapsed time: 0.1342918872833252 ==================================== # 第 2 次執行 request_success: 2 request_failure: 0 count: 2 exec_count: 2 request_duration_us: 2034243 queue_duration_us: 1621 compute_input_duration_us: 1143 compute_infer_duration_us: 2031084 compute_output_duration_us: 64 elapsed time: 0.1337299346923828 ==================================== # 第 1002 次執行 (共執行 1000 次) request_success: 1002 request_failure: 0 count: 1002 exec_count: 880 request_duration_us: 20369753 queue_duration_us: 3527719 compute_input_duration_us: 510323 compute_infer_duration_us: 16011901 compute_output_duration_us: 44640 elapsed time: 29.128565073013306 ``` - client 端千次執行，五次平均約需：31.1s - (29.129 + 28.715 + 31.847 + 32.209 + 33.770) / 5 - client 端的感受：推論一次約需 31.1ms - ### max_batch_size: 1 ``` max_batch_size: 1 dynamic_batching { preferred_batch_size: [ 1,2,4,8,16,32,64,128 ] max_queue_delay_microseconds: 30000 } ``` - Server log 長這樣 > Poll failed for model directory 'inception_graphdef': dynamic batching preferred size must be <= max batch size for inception_graphdef - Client 端的訊息 > {"error":"failed to load 'inception_graphdef', failed to poll from model repository"} - 正常推論 metrics ``` request_success: 1002 request_failure: 0 count: 1002 exec_count: 1002 <--- 同 request_success request_duration_us: 21850290 queue_duration_us: 7397434 compute_input_duration_us: 310838 compute_infer_duration_us: 13945969 compute_output_duration_us: 36390 ``` - client 端千次執行，五次平均約需：30.3s - (27.931 + 28.231 + 31.997 + 31.697 + 31.685) / 5 - client 端的感受：推論一次約需 30.3ms ### Warmup 設定 > doc: [Warmup](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-warmup) - [範例？](https://github.com/triton-inference-server/server/issues/708) ```json model_warmup { name: "4x4 Warmup" batch_size: 1 inputs: { key: "input" value: { data_type: TYPE_FP32 dims: 1024 dims: 1024 dims: 3 zero_data: true } } } ``` <hr> ## App / [add_sub](https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub) > - X=[x1, x2, x3, x4] (4個整數) > - Y=[y1, y2, y3, y4] (4個整數) > - 計算 X + Y > - 計算 X - Y - [config.pbtxt](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/config.pbtxt) - [model.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/model.py) - [client.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/client.py) ### [模型管理][explicit] 載入模型的多個版本 > `--model-control-mode=explicit`  1. ### 要重新 reload (unload + load) ，都會遇到 shared memory 不足 ![](https://i.imgur.com/UGDbwOt.png) > Server log: > > E0819 10:49:44.820995 1 model_repository_manager.cc:1215] failed to load 'add_sub' version 2: Internal: Unable to initialize shared memory key '/add_sub_0_CPU_0' to requested size (67108864 bytes). > > If you are running Triton inside docker, use '`--shm-size`' flag to control the shared memory region size. > > Each Python backend model instance requires ==**at least 64MBs of shared memory**==. > > Flag ==**'`--shm-size=5G`'**== should be sufficient for common usecases. Error: No such file or directory > > I0819 10:49:44.923932 1 python.cc:1489] TRITONBACKEND_ModelInstanceInitialize: add_sub_0 (CPU device 0) - 備註：此為 docker 參數，非 Triton 參數 2. ### 要重新 reload ，最後，還是沒有成功 ```json [ { "name": "add_sub", "version": "1", "state": "UNLOADING" <--- unload 沒成功 }, { "name": "add_sub", "version": "2", "state": "READY" }, { "name": "add_sub", "version": "3", "state": "READY" } ] ``` ### 查看「指定版本」的模型組態 ```bash= # 最新版? $ curl -v $URL/v2/models/add_sub/config | jq # 摘要 $ curl -v $URL/v2/models/add_sub/versions/1 | jq # 詳細資訊 $ curl -v $URL/v2/models/add_sub/versions/1/config | jq ``` - 執行結果 :::spoiler `$ curl -v $URL/v2/models/add_sub/versions/1 | jq` ```json= { "name": "add_sub", "versions": [ "1" ], "platform": "python", "inputs": [ { "name": "INPUT0", "datatype": "FP32", "shape": [ 4 ] }, { "name": "INPUT1", "datatype": "FP32", "shape": [ 4 ] } ], "outputs": [ { "name": "OUTPUT0", "datatype": "FP32", "shape": [ 4 ] }, { "name": "OUTPUT1", "datatype": "FP32", "shape": [ 4 ] } ] } ``` ::: :::spoiler `$ curl -v $URL/v2/models/add_sub/versions/1/config | jq` ```json= { "name": "add_sub", "platform": "", "backend": "python", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 0, "input": [ { "name": "INPUT0", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 4 ], "is_shape_tensor": false, "allow_ragged_batch": false }, { "name": "INPUT1", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 4 ], "is_shape_tensor": false, "allow_ragged_batch": false } ], "output": [ { "name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [ 4 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [ 4 ], "label_filename": "", "is_shape_tensor": false } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "add_sub_0", "kind": "KIND_CPU", "count": 1, "gpus": [], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "", "cc_model_filenames": {}, "metric_tags": {}, "parameters": {}, "model_warmup": [] } ``` ::: - 參考資料 - [server/docs/protocol/extension_model_configuration.md](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md) ### 發送 add_sub request ```bash= $ curl -H 'Content-Type: application/json' \ --request POST \ --data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \ http://localhost:9000/v2/models/add_sub/infer | jq ``` - ### `--data` 參數 ```json= 'inputs': [ { "name": "INPUT0", "shape": [4], "datatype": "FP32", "data":[1,2,3,4] }, { "name":"INPUT1", "shape":[4], "datatype":"FP32", "data":[1,1,1,1] } ] ``` - ### 參考資料 - [model.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/model.py) - [client.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/client.py) - [[Classification Extension] HTTP/REST ](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md) ### [統計] 查看該 app 的 metrics ```bash= # 全部版本的 metrics $ curl -v $URL/v2/models/add_sub/stats | jq # 指定版本 1 $ curl -v $URL/v2/models/add_sub/versions/1/stats | jq # 指定版本 2 $ curl -v $URL/v2/models/add_sub/versions/2/stats | jq ``` <hr> ## App / Images test ### 環境需求 ```bash= pip3 install numpy pip3 install Pillow pip3 install attrdict pip install nvidia-pyindex pip3 install tritonclient[all] ``` ### onnx / models - [resnet / model](https://github.com/onnx/models/tree/master/vision/classification/resnet/model) ### 基於 python 的請求 [image_client.py](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py) ```bash= $ python3.6 image_client.py \ -m resnet18_onnx \ -s INCEPTION \ images/mug.jpg \ -x 1 Request 1, batch size 1 16.953758 (504) = n03063599 coffee mug PASS $ python3.6 image_client.py \ -m inception_graphdef \ -s INCEPTION \ images/mug.jpg \ -x 2 \ -u localhost:9000 Request 1, batch size 1 0.826453 (505) = COFFEE MUG PASS ``` - `-m` 是 model - `-s` 是 image scaler 的處理方式 - `-x` 可以指定 model version - `-x 1`: /models/resnet18_onnx/1 - `-x 2`: /models/resnet18_onnx/2 - `-u localhost:9000` 可以指定其他的端點 ### 批次? - [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching) <hr> ## 無止境送 request > 針對 app / add_sub 來測試 ### request 處理上限 | | 時間 | 已處理請求 | |--------| -------- | -------- | | | 14:45:30 | 36.5K | | | 16:32:45 | 1302546 | |**相差**| **6435秒** | **1266046** | - 1266046 / 6435 = 196.7 (請求/秒) - 每個請求，約需 0.0051 秒 (5.1ms) --- ### 3 個 process 持續發送 request (等待回應) ![](https://i.imgur.com/3tQmldi.png) - 循序啟動 3 個 process - 等候時間：1 ~ 3 (ms) (波動較平穩) - :::spoiler code ```bash= for i in range(100000): if i % 100 == 0: print(i) !curl -H 'Content-Type: application/json' \ --request POST \ --data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \ http://localhost:8000/v2/models/add_sub/infer \ 2&> /dev/null ``` ::: --- ### 3 個 process 持續發送 request (不等待回應) ![](https://i.imgur.com/xZxJfB1.png) - 循序啟動 3 個 process - 等候時間：100 ~ 200 (ms) (波動較劇烈) ![](https://i.imgur.com/GzxcAtC.png) - :::spoiler code ```python= cmd=''' curl -H 'Content-Type: application/json' \ --request POST \ --data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \ http://localhost:8000/v2/models/add_sub/infer \ 2&> /dev/null & ''' import os for i in range(1000000): if i % 1000 == 0: print(i) os.system(cmd) ``` ::: <hr> ## Triton Server 穩定度狀況 1. config 或是 curl 沒寫好, 一開始就會有問題 2. API 沒回應，像 unload API 3. 目前 crash 遇到的是：正常操作 (前面是好的，後面就掛彩) <hr> ## 佈署到 K8s > #TritonServer #Kubernetes #K8s #Deploy #佈署 #端點 #CloudInfra - ### [Deploy NVIDIA Triton Inference Server (Automated Deployment)](https://docs.netapp.com/us-en/hci-solutions/hciaiedge_deploy_nvidia_triton_inference_server_automated_deployment.html) - ### `pvc-triton-model- repo.yaml` > `kubectl create -f pvc-triton-model-repo.yaml` ```json= kind: PersistentVolumeClaim apiVersion: v1 metadata: name: triton-pvc namespace: triton spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: ontap-flexvol ``` - ### `triton_deployment.yaml` > - **Service**: 3gpu, 1gpu 兩個 > - **Deployment** ```json= --- apiVersion: v1 kind: Service metadata: labels: app: triton-3gpu name: triton-3gpu namespace: triton spec: ports: - name: grpc-trtis-serving port: 8001 targetPort: 8001 - name: http-trtis-serving port: 8000 targetPort: 8000 - name: prometheus-metrics port: 8002 targetPort: 8002 selector: app: triton-3gpu type: LoadBalancer --- apiVersion: v1 kind: Service metadata: labels: app: triton-1gpu name: triton-1gpu namespace: triton spec: ports: - name: grpc-trtis-serving port: 8001 targetPort: 8001 - name: http-trtis-serving port: 8000 targetPort: 8000 - name: prometheus-metrics port: 8002 targetPort: 8002 selector: app: triton-1gpu type: LoadBalancer --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: triton-3gpu name: triton-3gpu namespace: triton spec: replicas: 1 selector: matchLabels: app: triton-3gpu version: v1 template: metadata: labels: app: triton-3gpu version: v1 spec: containers: - image: nvcr.io/nvidia/tritonserver:20.07-v1-py3 command: ["/bin/sh", "-c"] args: ["trtserver --model-store=/mnt/model-repo"] imagePullPolicy: IfNotPresent name: triton-3gpu ports: - containerPort: 8000 - containerPort: 8001 - containerPort: 8002 resources: limits: cpu: "2" memory: 4Gi nvidia.com/gpu: 3 requests: cpu: "2" memory: 4Gi nvidia.com/gpu: 3 volumeMounts: - name: triton-model-repo mountPath: /mnt/model-repo nodeSelector: gpu-count: “3” volumes: - name: triton-model-repo persistentVolumeClaim: claimName: triton-pvc --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: triton-1gpu name: triton-1gpu namespace: triton spec: replicas: 3 selector: matchLabels: app: triton-1gpu version: v1 template: metadata: labels: app: triton-1gpu version: v1 spec: containers: - image: nvcr.io/nvidia/tritonserver:20.07-v1-py3 command: ["/bin/sh", "-c", “sleep 1000”] args: ["trtserver --model-store=/mnt/model-repo"] imagePullPolicy: IfNotPresent name: triton-1gpu ports: - containerPort: 8000 - containerPort: 8001 - containerPort: 8002 resources: limits: cpu: "2" memory: 4Gi nvidia.com/gpu: 1 requests: cpu: "2" memory: 4Gi nvidia.com/gpu: 1 volumeMounts: - name: triton-model-repo mountPath: /mnt/model-repo nodeSelector: gpu-count: “1” volumes: - name: triton-model-repo persistentVolumeClaim: claimName: triton-pvc ``` <hr> ## 參考資料 - ### [[官網] Quickstart](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md) - ### [[官網] Q&A](https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/97) - [Rest-api/curl command for posting images to Triton Inference server](https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/97) - ### [Triton Inference Server 介紹與範例](https://roychou121.github.io/2020/07/20/nvidia-triton-inference-server/) - ### [Triton model ensembles](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/) ![](https://i.imgur.com/351KZt2.png) - ### [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching) - ### [TensorFlow BERT model config](https://blog.einstein.ai/benchmarking-tensorrt-inference-server/) ```json= name: "bert" platform: "tensorflow_savedmodel" max_batch_size: 64 input { name: "input_0" data_type: TYPE_INT32 dims: [ -1 ] } output { name: "output_0" data_type: TYPE_FP32 dims: [ 2 ] } dynamic_batching { preferred_batch_size: [ 1,2,4,8,16,32,64 ] max_queue_delay_microseconds: 30000 } version_policy: { latest { num_versions : 1 }} optimization { graph { level: 1 } } ``` - ### [[Azure] 使用 Triton 推斷伺服器的高效能服務 (預覽版)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=azcli)