[Cloud / Tool] Triton Server
===
> [[雲端]](/1s3Doa57Rk6cgEZdYBWrVQ)
###### tags: `Cloud / Tool`
###### tags: `Cloud`, `Tool`, `Triton`, `Triton Server`
<br>
[TOC]
<br>
> 測試平台:ESC4000
:::info
:information_source: **Triton 目前要專注的目標:**
- https://github.com/triton-inference-server/python_backend
這篇提到怎麼讓 Triton Server 載入 python model, 提供推論服務
以及有 client.py 示範如何執行 inference
- https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
提到的 support-matrix
它列出了 不同版本的 Triton Server, 所需要的 driver / cuda version
- https://github.com/triton-inference-server/fil_backend
有個 FIL backend, 可以吃 XGBoost, LightGBM, Scikit-Learn, and cuML 模型
- https://github.com/triton-inference-server/server/tree/main/docs/protocol
triton 除了支援標準的 inference protocol (KFServing community standard inference protocols), 它同時也提供了幾個extension. 或許可以從這裡著手
Classification extension: TJ Tsai(蔡宗容) 幫忙試試看這個, 用它來打 victor 架設的分類模型
- [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
:::
<br>
## 簡介
- ### [[官方] 利用 NVIDIA Triton 2.3 簡化和擴充推論服務](https://blogs.nvidia.com.tw/2021/01/26/simplifying-and-scaling-inference-serving-with-triton-2-3/)
- ### [Triton Inference Server - 简化手册](https://zhuanlan.zhihu.com/p/366555962)
<br>
## GitHub
> [triton-inference-server](https://github.com/triton-inference-server)
### [server](https://github.com/triton-inference-server/server)
- [server/docs/protocol/](https://github.com/triton-inference-server/server/tree/main/docs/protocol)
- [Classification extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md)
<br>
<hr>
<br>
## ==Dashboard (Grafana)==
http://10.78.26.241:3800/
<!-- admin / ocistn1234 -->
<br>
## ==server==
### docker commnad
> 基本指令
```bash=
export models=/home/diatango_lin/tj_tsai/workspace/infra/triton_server/models
docker run \
--gpus=1 \
--rm \
-p 9000:8000 -p 9001:8001 -p 9002:8002 \
-v $models:/models \
nvcr.io/nvidia/tritonserver:21.06-py3 \
tritonserver \
--model-store=/models \
--strict-model-config=false
```
- ### 指令來源
- [server/docs/quickstart.md - Quickstart](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md)
- [Run on System with GPUs](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#run-on-system-with-gpus)
- ### 參數探討
- port
- Port 8000 為 HTTP 協定通道
- Port 8001 為 GRPC 協定通道
- Port 8002 為 metric 資源
- [tritonserver 启动参数](https://zhuanlan.zhihu.com/p/366555962)
- **GPU 參數**
`--cuda-memory-pool-byte-size=0:6442450944` (第0張,配置 6GB)

但 metrics 都維持 2986 GB (3131047936B)
:::warning
**Virtor 補充:**
https://github.com/triton-inference-server/server/blob/main/src/servers/main.cc#L473
trition 啟動能帶的參數沒有限制 GPU memory的,最接近的cuda-memory-pool-byte-size也是設定每次跟CUDA要的Memory pool size而已
:::
- **模型來源**
`--model-store=<模型資料夾>`
`--model-repository=<模型資料夾>`
兩個參數用法皆相同
**Step1**: 透過 `-v` 掛載主機端資料夾進來
**Step2**: 在跟 Triton 說 model 根目錄在哪
<br>
- **模型組態/模型設定檔**
> 如果为真,则必须提供模型配置文件,并且必须指定所有必要的配置设置。如果为假,则模型配置可能不存在或仅部分指定,服务器将尝试推导出缺少的所需配置。
- [[Azure] 定義模型設定檔](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=azcli)
[](https://i.imgur.com/c4A4uNu.png)
### 版本選擇
- [21.xx Framework Containers Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
- Container Image 版本
- 目前在 ESC4000 上測試的版本:21.06
- 2021/08/19 查閱:
> GeForce GTX 1080 Ti 已經可支援到 470
> 
> - SUPPORTED PRODUCTS
> **GeForce 10 Series:**
> GeForce GTX 1080 Ti, GeForce GTX 1080, GeForce GTX 1070 Ti, GeForce GTX 1070, GeForce GTX 1060, GeForce GTX 1050 Ti, GeForce GTX 1050, GeForce GT 1030, GeForce GT 1010
### GPU 設定
- `--gpus=1`
配置第 0 張
```
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 541915 C tritonserver 199MiB |
+-----------------------------------------------------------------------------+
```
- **指定不存在的 GPU**
i.e. 配置 GPU-0,卻選 GPU-1 來執行
```json=
instance_group [
{
count: 6
kind: KIND_GPU
gpus: [ 1 ] <---
}
]
```
透過 API 去 load model 時,會遇到 error:
> {"error":"failed to load 'resnet18_onnx',
> no version is available"}
- `--gpus=3`
配置第 0,1,2 張
```
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 537186 C tritonserver 199MiB |
| 1 N/A N/A 537186 C tritonserver 199MiB |
| 2 N/A N/A 537186 C tritonserver 199MiB |
+-----------------------------------------------------------------------------+
```
<br>
### Triton 架構
- [Triton Architecture](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#models-and-schedulers)
<br>
<hr>
<br>
## ==Client==
### Quick Start
- ### Step1: 先啟動 TritonServer
```bash=
export MODELS=/home/diatango_lin/tj_tsai/workspace/infra/triton_server/workspace/models
docker run \
--gpus=2 \
--rm \
-d \
-p 9000:8000 -p 9001:8001 -p 9002:8002 \
-v $MODELS:/models \
nvcr.io/nvidia/tritonserver:21.06-py3 \
tritonserver \
--model-store=/models \
--strict-model-config=false \
--model-control-mode=explicit
```
- ### 容器管理
- `-d`: daemon
跑在背景,與 terminal 解耦合
- 再透過 `docker logs` 持續擷取 log
`$ docker logs -f 4d81f08a5770`
- `-f, --follow`: Follow log output
- ### 模型管理
- 模型來源
`--model-store=<模型資料夾>`
`--model-repository=<模型資料夾>`
- ```--model-control-mode=explicit```
- 允許動態載入/卸載模型
- ### Step2: 連線進入 TritonServer
```bash=
$ export CONTAINER_ID=4d81f08a5770 # 範例
$ docker exec -it $CONTAINER_ID bash
# ----------------------------------
# 進入到容器,當前目錄在 /opt/tritonserver#
# 下載 python-based model 範例
$ git clone https://github.com/triton-inference-server/python_backend -b r21.06
$ cd python_backend
# 複製範例程式到 /models/add_sub 底下
$ mkdir -p /models/add_sub/1/
$ cp examples/add_sub/model.py /models/add_sub/1/model.py
$ cp examples/add_sub/config.pbtxt /models/add_sub/config.pbtxt
# 檢視 /models 目錄
$ apt update && apt install tree
$ tree /models
/models
`-- add_sub
|-- 1
| |-- __pycache__
| | `-- model.cpython-38.pyc
| `-- model.py
`-- config.pbtxt
3 directories, 3 files
```
在另外的 terminal ,即可即時查到
```bash=
$ curl -v -X POST $URL/v2/repository/index | jq
```
- ### Step3: 起另一個 docker 環境,作為 client
```bash=
# 使用預先裝好 python 相關套件的 docker
$ docker run \
--name triton_client \
--rm -it \
--net host \
nvcr.io/nvidia/tritonserver:21.06-py3-sdk \
/bin/bash
# 下載 python-based 的 client
$ git clone https://github.com/triton-inference-server/python_backend -b r21.06
# 執行範例程式
$ python3 python_backend/examples/add_sub/client.py
```
- :::spoiler `pip list`
```
root@stage-kube01:/workspace# pip list
Package Version
--------------------- --------------------
attrdict 2.0.1
certifi 2019.11.28
cffi 1.14.5
chardet 3.0.4
cryptography 3.4.7
cycler 0.10.0
dbus-python 1.2.16
distro 1.5.0
docker 5.0.0
gevent 21.1.2
geventhttpclient 1.4.4
greenlet 1.1.0
grpcio 1.38.0
grpcio-tools 1.38.0
httplib2 0.19.1
idna 2.8
kiwisolver 1.3.1
llvmlite 0.36.0
matplotlib 3.4.2
numba 0.53.1
numpy 1.20.3
pdfkit 0.6.1
Pillow 8.2.0
pip 21.1.2
prometheus-client 0.11.0
protobuf 3.17.3
psutil 5.8.0
pycparser 2.20
PyGObject 3.36.0
pyparsing 2.4.7
python-apt 2.0.0+ubuntu0.20.4.5
python-dateutil 2.8.1
python-rapidjson 1.0
PyYAML 5.4.1
requests 2.25.1
requests-unixsocket 0.2.0
setuptools 57.0.0
six 1.14.0
triton-model-analyzer 1.5.0
tritonclient 2.11.0
urllib3 1.25.8
websocket-client 1.1.0
wheel 0.36.2
zope.event 4.5.0
zope.interface 5.4.0
```
:::
<br>
<hr>
<br>
## ==curl==
### jupyter notebook
- http://10.78.26.241:48888/?token=a10d6a3470df3a77b328f9975d63e69ace8661bee127f474
### 查看 Triton Server 版本
```bash=
$ curl $URL/v2 | jq
# v2 後面不能加斜線(/),否則會得到 400 Bad Request
```
```json=
{
"name": "triton",
"version": "2.11.0",
"extensions": [
"classification",
"sequence",
"model_repository",
"model_repository(unload_dependents)",
"schedule_policy",
"model_configuration",
"system_shared_memory",
"cuda_shared_memory",
"binary_tensor_data",
"statistics"
]
}
```
<br>
### 查看 Triton Server 是否在運作
- [[指令來源] Verify Triton Is Running Correctly](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#verify-triton-is-running-correctly)
```bash=
$ curl -v $URL/v2/health/ready
```
```=
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: 10.78.26.241:8000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK <---
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host 10.78.26.241 left intact
```
<br>
### 查看 Triton Server metrics(度量指標)
> #grafana (將節點上 metrics 資訊傳給 Grafana,在儀表版上呈現)
> #prometheus (對運算結點進行監控)
> #計數 #計量 #度量 #指標 #效能 #利用率 #監視 #監控
```bash=
# 注意:port 是 8002
$ curl 10.78.26.241:8002/metrics
```
```json=
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000
nv_inference_request_success{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000
...
...
# HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
# TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 818303.081000
nv_energy_consumption{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 1140971.888000
```
- ### 怎麼看
- ### 推論請求:成功次數
> #inference_request_success
```
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000
nv_inference_request_success{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000
```
- ### 推論請求:失敗次數
> #inference_request_failure
```
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d",model="resnet",version="1"} 0.000000
nv_inference_request_failure{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316",model="resnet",version="1"} 0.000000
```
- ### GPU 使用率
> #gpu_utilization
```
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 0.000000
nv_gpu_utilization{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 0.000000
```
0.0 表示沒在使用
- ### GPU 記憶體使用量
> #gpu_memory_used_bytes
```
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 1271922688.000000
nv_gpu_memory_used_bytes{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 1150287872.000000
```
- 分別使用了 1.27, 1.15 GB
- ### GPU 電力使用量 (單位:瓦特)
> #gpu_power_usage
```
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-a2453a5f-9719-8c03-20da-961cea94e94d"} 8.024000
nv_gpu_power_usage{gpu_uuid="GPU-f9dc5b48-214b-7431-8d67-f73f58eb3316"} 8.311000
```
- 分別使用了 8.0, 8.3 Walt
- ### doc: [server/docs/metrics.md](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md)
- ### 依測量值分類
- 計數器(Counter):0, 1, 2, 3, ... (只增不減)
- 計量器(Gauge):[0.0, 1.0], [0.0, X] (區間震盪)
- 等待時間(Latency):10ms
- 直方圖(Histogram):min/max, mean, medium, quantiles
- 事件發生頻率...等等
- ### 依官方分類
- [Gauge] GPU Utilization
- 瞬間功率(電力使用) / 最大功率
- 能源消耗、GPU 利用率
- [Gauge] GPU Memory
- 已使用 / 全部
- Count
- 請求次數、推論次數、執行次數
- Latency
- 請求時間 = 排隊時間 + 計算時間
- 範例1:單次 image 推論
| time | us |
|------|----|
| request_duration_us | 2042477 |
| queue_duration_us | 1534 |
| compute_input_duration_us | 932 |
| compute_infer_duration_us | 2039865 |
| compute_output_duration_us | 33 |
- 排隊時間 = 1534
- 計算時間 = 932 + 2039865 + 33 = 2040830
- 請求時間 = 1534 + 2040830 = 2042364 (2042477)
- 範例2:千次 image 推論
| time | us |
|------|----|
| request_duration_us | 20403631 |
| queue_duration_us | 3451709 |
| compute_input_duration_us | 660467 |
| compute_infer_duration_us | 16000553 |
| compute_output_duration_us | 43172 |
- 過濾資料的 expr
find:
```regexp
^[\s|\d]*nv_inference_(\w+){.+}\s*(\d+).*
```
replace:
```
\1: \2
```
<br>
### 查看 Triton Server 上的 model 清單
```bash=
$ curl -v -X POST $URL/v2/repository/index | jq
```
- 執行結果
:::spoiler `curl -v -X POST $URL/v2/repository/index | jq`
```=
* Trying 10.78.26.241...
> POST /v2/repository/index HTTP/1.1
> Host: 10.78.26.241:8000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 98
<
* Connection #0 to host 10.78.26.241 left intact
[
{
"name": "add_sub",
"version": "1",
"state": "READY"
},
{
"name": "resnet",
"version": "1",
"state": "READY"
}
]
```
:::
- 補充說明
- 即使還沒載入,也會出現在清單上
> even if it is not currently loaded into Triton
<br>
### [模型管理] 動態載入模型/卸載模型
- ### [API介紹] [server/docs/protocol/extension_model_repository.md](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_repository.md)
> allows a client to query and control the one or more model repositories being served by Triton
> 允許客戶端**查詢**和**控制**由 Triton 提供的一個或多個模型儲存庫
```
POST v2/repository/index
POST v2/repository/models/${MODEL_NAME}/load
POST v2/repository/models/${MODEL_NAME}/unload
```
- 執行結果
:::warning
:warning: **摘要:資料變化**
model 根目錄下,有先預放 add_sub model 相關資料
- **STEP 1: index**
```[{"name":"add_sub"}]```
- **STEP 2: load**
- **STEP 3: index**
```[{"name":"add_sub","version":"1","state":"READY"}]```
- **STEP 4: unload**
- **STEP 5: index**
```[{"name":"add_sub","version":"1","state":"UNAVAILABLE","reason":"unloaded"}]```
- **STEP 6: load**
- **STEP 7: index**
```[{"name":"add_sub","version":"1","state":"READY"}]```
:::
:::spoiler `curl -v -X POST $URL/v2/repository/index`
```
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/index HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 76
<
* Connection #0 to host 10.78.26.241 left intact
[{"name":"add_sub"}]
:::
:::spoiler `curl -v -X POST $URL/v2/repository/models/add_sub/load`
```
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/models/add_sub/load HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 0
<
* Connection #0 to host 10.78.26.241 left intact
```
:::
:::spoiler `curl -v -X POST $URL/v2/repository/index`
```
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/index HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 50
<
* Connection #0 to host 10.78.26.241 left intact
[{"name":"add_sub","version":"1","state":"READY"}]
```
:::
:::spoiler `curl -v -X POST $URL/v2/repository/models/add_sub/unload`
```
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/models/add_sub/unload HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 0
<
* Connection #0 to host 10.78.26.241 left intact
```
:::
:::spoiler `curl -v -X POST $URL/v2/repository/index`
```
* Trying 10.78.26.241...
* Connected to 10.78.26.241 (10.78.26.241) port 9000 (#0)
> POST /v2/repository/index HTTP/1.1
> Host: 10.78.26.241:9000
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 76
<
* Connection #0 to host 10.78.26.241 left intact
[{"name":"add_sub","version":"1","state":"UNAVAILABLE","reason":"unloaded"}]
```
:::
- 補充說明
- 執行 unload 指令後,再查詢當前的 model config 會有錯誤訊息
:::spoiler `curl -v $URL/v2/models/add_sub/config`
```
{"error":"Request for unknown model: 'add_sub' has no available versions"}
```
:::
<br>
- :warning: 如果 model 有開檔行為(即使都沒有變更)
沒執行 unload 前,就執行 load
這個 model 就死當了
<br>
- 其他不明原因無法載入 model
:::spoiler Server log
```
python.cc:1489] TRITONBACKEND_ModelInstanceInitialize: add_sub_0 (CPU device 0)
python.cc:600] Stub process successfully restarted.
python.cc:373] Failed to process the batch of requests.
python.cc:603] Stub process failed to restart. Your future requests to model add_sub_0 will fail. Error: Failed to initialize stub, stub process exited unexpectedly: add_sub_0
```
:::
:::spoiler `! curl -v -X POST $URL/v2/repository/models/add_sub/load`
```json=
{"error":"load failed for model 'add_sub'
: version 1
: Internal
: Unable to initialize shared memory key '/add_sub_0_CPU_0' to requested size (67108864 bytes).
If you are running Triton inside docker,
use '--shm-size' flag to control the shared memory region size.
Each Python backend model instance requires at least 64MBs of shared memory.
Flag '--shm-size=5G' should be sufficient for common usecases.
Error: No such file or directory;\n"}
```
:::
- ### [參數說明] [server/docs/model_management.md](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md)
- ### [Model Control Mode NONE](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) (**啟動就不可變更**)
`--model-control-mode=none`
- 啟動時,Server 會預設載入所有 model
- model repository 不可變更
- load/unload API 會有錯誤訊息
`{"error":"explicit model load / unload is not allowed if polling is enabled"}`
- [Error when load model with http api #2633](https://github.com/triton-inference-server/server/issues/2633)
- ### [Model Control Mode EXPLICIT](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-explicit) (**全由 API 操作**)
`--model-control-mode=explicit`
- Server 不會自動載入
- `--load-model command-line` 指定預載入 model
- 很明確地,由 API 進行 load / unload
- ### [Model Control Mode POLL](https://github.com/triton-inference-server/server/blob/main/docs/model_management.md#model-control-mode-poll) (**定時掃描**)
```
--model-control-mode=poll
--repository-poll-secs=5
```
- 啟動時,Server 會預設載入所有 model
- 有新 model 加入時,在掃瞄時會自動加入
- 掃瞄間隔,由 `--repository-poll-secs` 控制,單位秒數
- ### TritonServer 啟動參數
- `--model-control-mode <string>`
> Specify the mode for model management. Options are "none", "poll" and "explicit". The default is "none".
>
> For "none", the server will load all models in the model repository(s) at startup and will not make any changes to the load models after that.
>
> For "poll", the server will poll the model repository(s) to detect changes and will load/unload models based on those changes. The poll rate is controlled by 'repository-poll-secs'.
>
> For "explicit", model load and unload is initiated by using the model control APIs, and only models specified with --load-model will be loaded at startup.
- `--repository-poll-secs <integer>`
> Interval in seconds between each poll of the model repository to check for changes. Valid only when --model-control-mode=poll is specified.
<br>
<hr>
<br>
## ==模型組態==
### 多模型多、多版本架構
```
models/
└── add_sub
├── 1
│ ├── model.py
│ └── __pycache__
│ └── model.cpython-38.pyc
├── 2
│ ├── model.py
│ └── __pycache__
│ └── model.cpython-38.pyc
├── 3
│ ├── model.py
│ └── __pycache__
│ └── model.cpython-38.pyc
└── config.pbtxt
```
- Model Repository
- [server/docs/model_repository.md](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md)
- Model Configuration ([中文翻譯1](https://blog.csdn.net/sophia_xw/article/details/107009697), [中文翻譯2](https://www.twblogs.net/a/5eff6fb041b2036d50959650))
- [server/docs/model_configuration.md](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy)
- [Minimal Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#minimal-model-configuration)
- [Datatypes](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#datatypes)
- [Version Policy](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy)
- [Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy)
- [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
- [Warmup](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-warmup)
<br>
### config 可自行生成
- [Auto-Generated Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#auto-generated-model-configuration)
- 範例
```
$ curl -v $URL/v2/models/resnet18_onnx/versions/1/config | jq
```
<br>
### 模型有多個版本
> #ModelVersionPolicy
- [Version Policy](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#version-policy)
- **all**
`version_policy: { all { }}`
- **specific**
`version_policy: { specific { versions: [2] }}`
- 如果有 v1, v2, v3,將只會載入 v2
- **latest**
`version_policy: { latest { num_versions : 2 }}`
- 如果有 v1, v2, v3,將只會載入最新的 2 個 (v2 & v3)
<br>
### 模型有多個實體
- ### [Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups)
```json=
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_GPU
gpus: [ 1, 2 ]
}
]
```
- 在 GPU 0 上放置一個執行實體
- 在 GPU 1 和 2 上放置兩個執行實體
- ### [Concurrent Model Execution](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#models-and-schedulers)
- ### 單一實體

- ### 三個實體

<br>
### 批次處理
- [[Model Configuration] Maximum Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#maximum-batch-size)
```
max_batch_size: 8
```
- 對於不支援批次處理的模型,`max_batch_size` 必須設為 0
- `max_batch_size` 為 1 表示每次推論,最多只會處理 1 次 request
- `max_batch_size` 為 2 表示每次推論,最多可以一次處理 2 次 request
- [[Model Configuration] Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
```
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
```
- [[Nvidia][Performance] Batching techniques: Dynamic Batching](https://ngc.nvidia.com/catalog/resources/nvidia:jasper_for_trtis/performance)


- [缺點] At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds.
在並發請求數量較少時,增加的吞吐量是以增加延遲為代價的,因為請求排隊到 max_queue_delay_microseconds。
- ### resnet18_onnx 試驗
- max_batch_size : 4 or 8 or 12
- ### 輸入&輸出 張量維度
> [Maximum Batch Size](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#maximum-batch-size)
> > If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its **dynamic batcher** or **sequence batcher** to automatically use batching with the model.
> >
> > In this case ++**`max_batch_size` should be set to a value greater-or-equal-to 1**++ that indicates the maximum batch size that Triton should use with the model.
>
> keywords: #max_batch_size
- `max_batch_size` > 0
```json
platform: "tensorrt_plan"
max_batch_size: 8
input [
{
name: "input0"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 4 ]
}
]
```
- input0.shape 為 [ -1, 16 ]
- output0.shape 為 [ -1, 4 ]
- `max_batch_size` = 0
```
platform: "tensorrt_plan"
max_batch_size: 0
input [
{
name: "input0"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 4 ]
}
]
```
- input0.shape 為 [ 16 ]
- output0.shape 為 [ 4 ]
- POST `v2/repository/models/resnet18_onnx/load`
> {
> "error":"load failed for model 'resnet18_onnx'
> : version 1
> : Invalid argument
> : model 'resnet18_onnx',
>
> tensor 'data': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,3,224,224];\n"}
- [debug] GET `v2/models/resnet18_onnx/config`
> {
> "error": "Request for unknown model: 'resnet18_onnx' has no available versions"
> }
- [debug] POST `v2/repository/index`
> [{
> "name": "resnet18_onnx",
> "version": "1",
> "state": "UNAVAILABLE",
> "reason": "Invalid argument: model 'resnet18_onnx', tensor 'data': for the model to support batching the shape should have at least 1 dimension and the first dimension must be -1; but shape expected by the model is [1,3,224,224]"
> }]
- ### inception_graphdef
> [Batching techniques: Dynamic Batching](https://ngc.nvidia.com/catalog/resources/nvidia:jasper_for_trtis/performance)
> You can set the Dynamic Batcher parameter max_queue_delay_microseconds to indicate the maximum amount of time you are willing to wait and preferred_batch_size to indicate your maximum server batch size in the Triton Inference Server model config.
- ### 總表
**batch_size**: `preferred_batch_size`
**delay**: `max_queue_delay_microseconds`
**infer**: `compute_infer_duration_us`
| batch_size | delay(ms) | exec_count | infer_duration(ms) |
| --------------| ----:| ----:| ---------:|
| 1 | 0 | 1000 | 11613.672 |
| 1 | 300 | 1000 | 14089.080 |
| 1,2,3,4,5,6,7 | 3 | 864 | 22666.319 |
| 1,2,3,4,5,6,7 | 30 | 786 | 18522.102 |
| 1,2,3,4,5,6,7 | 300 | 851 | 13347.668 |
| 1,2,3,4,5,6,7 | 900 | 833 | 14471.256 |
| 1,2,3,4,5,6,7 | 1500 | 823 | 14108.980 |
| 16,32,64 | 3 | 828 | 17609.888 |
| 16,32,64 | 30 | 796 | 16468.601 |
| 16,32,64 | 300 | 773 | 17444.561 |
| 16,32,64 | 3000 | 762 | 16688.968 |
| 16,32,64 |30000 | 485 | 19989.942 |
| 128, 192, 256 | 3 | 833 | 17901.207 |
| 128, 192, 256 | 30 | 774 | 15446.807 |
| 128, 192, 256 | 300 | 843 | 16350.169 |
| 128, 192, 256 | 900 | 822 | 13331.119 |
| 128, 192, 256 | 1500 | 112 | 4801102.823 (不正常)
| 128, 192, 256 | 1500 | 795 | 17051.370 |
| 1000 | 30000 | 485 | 19989.942 |
| 1024 | 30000 | 467 | 19315.535 |
- 偶爾會有偶發性異常的指標,上面給予一個範例
- 甚至 exec_count 比 request_success 還要多
- batch_size 還是受限於 GPU Memory
> 2021-08-26 08:52:17.358634: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_ops.cc:504 : **Resource exhausted: OOM** when allocating tensor with **shape[1000,64,147,147]** and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
- 千次 request,上傳影像就要花 21.623s
- 把 `max_batch_size` 從 128 改成 0 會有錯誤
```
inference failed: 2 root error(s) found.
(0) Invalid argument: transpose expects a vector of size 3. But input(1) is a vector of size 4
[[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/BatchNorm/batchnorm/mul-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
(1) Invalid argument: transpose expects a vector of size 3. But input(1) is a vector of size 4
[[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/BatchNorm/batchnorm/mul-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
[[InceptionV3/Predictions/Softmax/_3]]
0 successful operations.
0 derived errors ignored.
```
- 再把 `max_batch_size` 從 0 改成 1 能正常運作
```
Request 1, batch size 1
0.826453 (505) = COFFEE MUG
PASS
```
- preferred_batch_size 必須為正整數
> **Server log:**
> dynamic batching preferred size must be positive for inception_graphdef
- delay 設為 0,還是有 batch 效果
```
max_batch_size: 1
dynamic_batching {
preferred_batch_size: [ 1]
max_queue_delay_microseconds: 0
}
```
- ### max_batch_size: 128, delay=3ms (3000µs)
```
max_batch_size: 128
dynamic_batching {
preferred_batch_size: [ 1,2,4,8,16,32,64,128 ]
max_queue_delay_microseconds: 3000
}
```
```
# 第 1 次執行 (避免冷啟動,先熱身) (但第1次和第2次,看起來沒差異)
request_success: 1
request_failure: 0
count: 1
exec_count: 1
request_duration_us: 2090404
queue_duration_us: 919
compute_input_duration_us: 306
compute_infer_duration_us: 2089016
compute_output_duration_us: 35
elapsed_time: 0.13400506973266602
====================================
# 第 2 次執行
request_success: 2
request_failure: 0
count: 2
exec_count: 2
request_duration_us: 2119275
queue_duration_us: 1012
compute_input_duration_us: 616
compute_infer_duration_us: 2117238
compute_output_duration_us: 72
elapsed_time: 0.1343550682067871
====================================
# 第 1002 次執行 (共執行 1000 次)
request_success: 1002
request_failure: 0
count: 1002
exec_count: 884
request_duration_us: 17447609
queue_duration_us: 2441652
compute_input_duration_us: 411897
compute_infer_duration_us: 14360615
compute_output_duration_us: 40784
elapsed_time: 27.938815116882324
```
- 單次推論,需 2ms;
884 次推論,起碼要 1768ms?為何只有 14ms ?
- `elapsed_time` 原比 `request_duration_us` 要多
猜測可能是 image 上傳到 Triton Server 比較費工
- client 端千次執行,五次平均約需:30.17s
- (27.939 + 28.193 + 29.293 + 32.706 + 32.719) / 5
- client 端的感受:推論一次約需 30.17ms
- ### max_batch_size: 128, delay=30ms (30000µs)
```
# 第 1002 次執行 (共執行 1000 次)
request_success: 1002
request_failure: 0
count: 1002
exec_count: 852
request_duration_us: 22679764
queue_duration_us: 4400009
compute_input_duration_us: 620710
compute_infer_duration_us: 17345940
compute_output_duration_us: 46439
elapsed time: 30.09218406677246
```
- ### max_batch_size: 128, delay=300ms (300000µs)
```
# 第 1 次執行 (避免冷啟動,先熱身) (但第1次和第2次,看起來沒差異)
request_success: 1
request_failure: 0
count: 1
exec_count: 1
request_duration_us: 2004857
queue_duration_us: 1541
compute_input_duration_us: 798
compute_infer_duration_us: 2002351
compute_output_duration_us: 28
elapsed time: 0.1342918872833252
====================================
# 第 2 次執行
request_success: 2
request_failure: 0
count: 2
exec_count: 2
request_duration_us: 2034243
queue_duration_us: 1621
compute_input_duration_us: 1143
compute_infer_duration_us: 2031084
compute_output_duration_us: 64
elapsed time: 0.1337299346923828
====================================
# 第 1002 次執行 (共執行 1000 次)
request_success: 1002
request_failure: 0
count: 1002
exec_count: 880
request_duration_us: 20369753
queue_duration_us: 3527719
compute_input_duration_us: 510323
compute_infer_duration_us: 16011901
compute_output_duration_us: 44640
elapsed time: 29.128565073013306
```
- client 端千次執行,五次平均約需:31.1s
- (29.129 + 28.715 + 31.847 + 32.209 + 33.770) / 5
- client 端的感受:推論一次約需 31.1ms
- ### max_batch_size: 1
```
max_batch_size: 1
dynamic_batching {
preferred_batch_size: [ 1,2,4,8,16,32,64,128 ]
max_queue_delay_microseconds: 30000
}
```
- Server log 長這樣
> Poll failed for model directory 'inception_graphdef': dynamic batching preferred size must be <= max batch size for inception_graphdef
- Client 端的訊息
> {"error":"failed to load 'inception_graphdef', failed to poll from model repository"}
- 正常推論 metrics
```
request_success: 1002
request_failure: 0
count: 1002
exec_count: 1002 <--- 同 request_success
request_duration_us: 21850290
queue_duration_us: 7397434
compute_input_duration_us: 310838
compute_infer_duration_us: 13945969
compute_output_duration_us: 36390
```
- client 端千次執行,五次平均約需:30.3s
- (27.931 + 28.231 + 31.997 + 31.697 + 31.685) / 5
- client 端的感受:推論一次約需 30.3ms
### Warmup 設定
> doc: [Warmup](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-warmup)
- [範例?](https://github.com/triton-inference-server/server/issues/708)
```json
model_warmup {
name: "4x4 Warmup"
batch_size: 1
inputs: {
key: "input"
value: {
data_type: TYPE_FP32
dims: 1024
dims: 1024
dims: 3
zero_data: true
}
}
}
```
<br>
<hr>
<br>
## App / [add_sub](https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub)
> - X=[x1, x2, x3, x4] (4個整數)
> - Y=[y1, y2, y3, y4] (4個整數)
> - 計算 X + Y
> - 計算 X - Y
- [config.pbtxt](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/config.pbtxt)
- [model.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/model.py)
- [client.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/client.py)
<br>
### [模型管理][explicit] 載入模型的多個版本
> `--model-control-mode=explicit`
<!--
1. `version_policy: { all { }}`:在有 3 個版本情況下
- 一開始,直接執行 load 會卡死
- 結論:
- 一開始,要先執行 unload,再執行 load
-->
1. ### 要重新 reload (unload + load) ,都會遇到 shared memory 不足

> Server log:
>
> E0819 10:49:44.820995 1 model_repository_manager.cc:1215] failed to load 'add_sub' version 2: Internal: Unable to initialize shared memory key '/add_sub_0_CPU_0' to requested size (67108864 bytes).
>
> If you are running Triton inside docker, use '`--shm-size`' flag to control the shared memory region size.
>
> Each Python backend model instance requires ==**at least 64MBs of shared memory**==.
>
> Flag ==**'`--shm-size=5G`'**== should be sufficient for common usecases. Error: No such file or directory
>
> I0819 10:49:44.923932 1 python.cc:1489] TRITONBACKEND_ModelInstanceInitialize: add_sub_0 (CPU device 0)
- 備註:此為 docker 參數,非 Triton 參數
2. ### 要重新 reload ,最後,還是沒有成功
```json
[
{
"name": "add_sub",
"version": "1",
"state": "UNLOADING" <--- unload 沒成功
},
{
"name": "add_sub",
"version": "2",
"state": "READY"
},
{
"name": "add_sub",
"version": "3",
"state": "READY"
}
]
```
<br>
### 查看「指定版本」的模型組態
```bash=
# 最新版?
$ curl -v $URL/v2/models/add_sub/config | jq
# 摘要
$ curl -v $URL/v2/models/add_sub/versions/1 | jq
# 詳細資訊
$ curl -v $URL/v2/models/add_sub/versions/1/config | jq
```
- 執行結果
:::spoiler `$ curl -v $URL/v2/models/add_sub/versions/1 | jq`
```json=
{
"name": "add_sub",
"versions": [
"1"
],
"platform": "python",
"inputs": [
{
"name": "INPUT0",
"datatype": "FP32",
"shape": [
4
]
},
{
"name": "INPUT1",
"datatype": "FP32",
"shape": [
4
]
}
],
"outputs": [
{
"name": "OUTPUT0",
"datatype": "FP32",
"shape": [
4
]
},
{
"name": "OUTPUT1",
"datatype": "FP32",
"shape": [
4
]
}
]
}
```
:::
:::spoiler `$ curl -v $URL/v2/models/add_sub/versions/1/config | jq`
```json=
{
"name": "add_sub",
"platform": "",
"backend": "python",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "INPUT0",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
4
],
"is_shape_tensor": false,
"allow_ragged_batch": false
},
{
"name": "INPUT1",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
4
],
"is_shape_tensor": false,
"allow_ragged_batch": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
4
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "OUTPUT1",
"data_type": "TYPE_FP32",
"dims": [
4
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "add_sub_0",
"kind": "KIND_CPU",
"count": 1,
"gpus": [],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
```
:::
- 參考資料
- [server/docs/protocol/extension_model_configuration.md](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
<br>
### 發送 add_sub request
```bash=
$ curl -H 'Content-Type: application/json' \
--request POST \
--data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \
http://localhost:9000/v2/models/add_sub/infer | jq
```
- ### `--data` 參數
```json=
'inputs': [
{
"name": "INPUT0",
"shape": [4],
"datatype": "FP32",
"data":[1,2,3,4]
}, {
"name":"INPUT1",
"shape":[4],
"datatype":"FP32",
"data":[1,1,1,1]
}
]
```
- ### 參考資料
- [model.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/model.py)
- [client.py](https://github.com/triton-inference-server/python_backend/blob/main/examples/add_sub/client.py)
- [[Classification Extension] HTTP/REST
](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md)
<br>
### [統計] 查看該 app 的 metrics
```bash=
# 全部版本的 metrics
$ curl -v $URL/v2/models/add_sub/stats | jq
# 指定版本 1
$ curl -v $URL/v2/models/add_sub/versions/1/stats | jq
# 指定版本 2
$ curl -v $URL/v2/models/add_sub/versions/2/stats | jq
```
<br>
<hr>
<br>
## App / Images test
### 環境需求
```bash=
pip3 install numpy
pip3 install Pillow
pip3 install attrdict
pip install nvidia-pyindex
pip3 install tritonclient[all]
```
<br>
### onnx / models
- [resnet / model](https://github.com/onnx/models/tree/master/vision/classification/resnet/model)
<br>
### 基於 python 的請求
[image_client.py](https://github.com/triton-inference-server/client/blob/main/src/python/examples/image_client.py)
```bash=
$ python3.6 image_client.py \
-m resnet18_onnx \
-s INCEPTION \
images/mug.jpg \
-x 1
Request 1, batch size 1
16.953758 (504) = n03063599 coffee mug
PASS
$ python3.6 image_client.py \
-m inception_graphdef \
-s INCEPTION \
images/mug.jpg \
-x 2 \
-u localhost:9000
Request 1, batch size 1
0.826453 (505) = COFFEE MUG
PASS
```
- `-m` 是 model
- `-s` 是 image scaler 的處理方式
- `-x` 可以指定 model version
- `-x 1`: /models/resnet18_onnx/1
- `-x 2`: /models/resnet18_onnx/2
- `-u localhost:9000` 可以指定其他的端點
### 批次?
- [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
<br>
<hr>
<br>
## 無止境送 request
> 針對 app / add_sub 來測試
### request 處理上限
| | 時間 | 已處理請求 |
|--------| -------- | -------- |
| | 14:45:30 | 36.5K |
| | 16:32:45 | 1302546 |
|**相差**| **6435秒** | **1266046** |
- 1266046 / 6435 = 196.7 (請求/秒)
- 每個請求,約需 0.0051 秒 (5.1ms)
---
### 3 個 process 持續發送 request (等待回應)

- 循序啟動 3 個 process
- 等候時間:1 ~ 3 (ms) (波動較平穩)
- :::spoiler code
```bash=
for i in range(100000):
if i % 100 == 0: print(i)
!curl -H 'Content-Type: application/json' \
--request POST \
--data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \
http://localhost:8000/v2/models/add_sub/infer \
2&> /dev/null
```
:::
---
### 3 個 process 持續發送 request (不等待回應)

- 循序啟動 3 個 process
- 等候時間:100 ~ 200 (ms) (波動較劇烈)

- :::spoiler code
```python=
cmd='''
curl -H 'Content-Type: application/json' \
--request POST \
--data '{"inputs":[{"name":"INPUT0","shape":[4], "datatype":"FP32","data":[1,2,3,4]}, {"name":"INPUT1","shape":[4],"datatype":"FP32","data":[1,1,1,1]}]}' \
http://localhost:8000/v2/models/add_sub/infer \
2&> /dev/null &
'''
import os
for i in range(1000000):
if i % 1000 == 0: print(i)
os.system(cmd)
```
:::
<br>
<hr>
<br>
## Triton Server 穩定度狀況
1. config 或是 curl 沒寫好, 一開始就會有問題
2. API 沒回應,像 unload API
3. 目前 crash 遇到的是:正常操作 (前面是好的,後面就掛彩)
<br>
<hr>
<br>
## 佈署到 K8s
> #TritonServer #Kubernetes #K8s #Deploy #佈署 #端點 #CloudInfra
- ### [Deploy NVIDIA Triton Inference Server (Automated Deployment)](https://docs.netapp.com/us-en/hci-solutions/hciaiedge_deploy_nvidia_triton_inference_server_automated_deployment.html)
- ### `pvc-triton-model- repo.yaml`
> `kubectl create -f pvc-triton-model-repo.yaml`
```json=
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: triton-pvc
namespace: triton
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: ontap-flexvol
```
- ### `triton_deployment.yaml`
> - **Service**: 3gpu, 1gpu 兩個
> - **Deployment**
```json=
---
apiVersion: v1
kind: Service
metadata:
labels:
app: triton-3gpu
name: triton-3gpu
namespace: triton
spec:
ports:
- name: grpc-trtis-serving
port: 8001
targetPort: 8001
- name: http-trtis-serving
port: 8000
targetPort: 8000
- name: prometheus-metrics
port: 8002
targetPort: 8002
selector:
app: triton-3gpu
type: LoadBalancer
---
apiVersion: v1
kind: Service
metadata:
labels:
app: triton-1gpu
name: triton-1gpu
namespace: triton
spec:
ports:
- name: grpc-trtis-serving
port: 8001
targetPort: 8001
- name: http-trtis-serving
port: 8000
targetPort: 8000
- name: prometheus-metrics
port: 8002
targetPort: 8002
selector:
app: triton-1gpu
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: triton-3gpu
name: triton-3gpu
namespace: triton
spec:
replicas: 1
selector:
matchLabels:
app: triton-3gpu
version: v1
template:
metadata:
labels:
app: triton-3gpu
version: v1
spec:
containers:
- image: nvcr.io/nvidia/tritonserver:20.07-v1-py3
command: ["/bin/sh", "-c"]
args: ["trtserver --model-store=/mnt/model-repo"]
imagePullPolicy: IfNotPresent
name: triton-3gpu
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
cpu: "2"
memory: 4Gi
nvidia.com/gpu: 3
requests:
cpu: "2"
memory: 4Gi
nvidia.com/gpu: 3
volumeMounts:
- name: triton-model-repo
mountPath: /mnt/model-repo
nodeSelector: gpu-count: “3”
volumes:
- name: triton-model-repo
persistentVolumeClaim:
claimName: triton-pvc
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: triton-1gpu
name: triton-1gpu
namespace: triton
spec:
replicas: 3
selector:
matchLabels:
app: triton-1gpu
version: v1
template:
metadata:
labels:
app: triton-1gpu
version: v1
spec:
containers:
- image: nvcr.io/nvidia/tritonserver:20.07-v1-py3
command: ["/bin/sh", "-c", “sleep 1000”]
args: ["trtserver --model-store=/mnt/model-repo"]
imagePullPolicy: IfNotPresent
name: triton-1gpu
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
cpu: "2"
memory: 4Gi
nvidia.com/gpu: 1
requests:
cpu: "2"
memory: 4Gi
nvidia.com/gpu: 1
volumeMounts:
- name: triton-model-repo
mountPath: /mnt/model-repo nodeSelector:
gpu-count: “1”
volumes:
- name: triton-model-repo
persistentVolumeClaim:
claimName: triton-pvc
```
<br>
<hr>
<br>
## 參考資料
- ### [[官網] Quickstart](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md)
- ### [[官網] Q&A](https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/97)
- [Rest-api/curl command for posting images to Triton Inference server](https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/97)
- ### [Triton Inference Server 介紹與範例](https://roychou121.github.io/2020/07/20/nvidia-triton-inference-server/)
- ### [Triton model ensembles](https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/)

- ### [Scheduling And Batching](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#scheduling-and-batching)
- ### [TensorFlow BERT model config](https://blog.einstein.ai/benchmarking-tensorrt-inference-server/)
```json=
name: "bert"
platform: "tensorflow_savedmodel"
max_batch_size: 64
input {
name: "input_0"
data_type: TYPE_INT32
dims: [ -1 ]
}
output {
name: "output_0"
data_type: TYPE_FP32
dims: [ 2 ]
}
dynamic_batching {
preferred_batch_size: [ 1,2,4,8,16,32,64 ]
max_queue_delay_microseconds: 30000
}
version_policy: { latest { num_versions : 1 }}
optimization {
graph { level: 1 }
}
```
- ### [[Azure] 使用 Triton 推斷伺服器的高效能服務 (預覽版)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=azcli)