# Triton Inference Server on Gemini AIConsole
如下圖所示,Triton 是一個 Clinet-Server 架構,用於快速架設推論伺服器,支援多 Model 管理、並行化、低延遲、高吞吐量等特性.

[TOC]
## Prepare model repository
:::info
使用者可以先在自己電腦上利用 Docker 建置,熟悉一下 Triton Server 是怎麼運作的,
可以閱讀 [**官方文件**](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#quickstart)
:::
1. 開啟一個 tensorflow solution,訓練我們的模型([範例](#Mnist-training))

2. 訓練好模型後,我們將模型導出得到 **model.savedmodel**

3. 接著我們將訓練好的 model 放入 model repository (此次範例是/home/models),此時你的資料結構會是這個樣子
:::info
Triton 以資料夾名稱作為版本號,這裡部署的是 **1** 版
:::

4. 但還沒結束,我們還要告訴 triton 這個 model 部署的參數 ([範例](#config.pbtxt))

5. 將以上 config 設定好後,儲存為 config.pbtxt 並放入 mnist 資料夾下,此時會長這個樣子

:::success
這樣我們就完成訓備模型 -> 準備好 model repository 了
接下來就是啟動 triton server 部署這個 model
:::
## Deploy model with Triton Server on AIConsole
1. 在 Container Site 選擇建立 Triton 服務

2. 輸入 password 後,輸入 tritonserver 指令
根據本次的範例,需要輸入的指令為
```shell=
tritonserver --model-store=/home/models
--backend-config=tensorflow,version=2
--backend-config=tensorflow,allow-soft-placement=0
--model-control-mode=poll --repository-poll-secs=5
```
--model-store: 選擇 model repository 的位置
--backend-config: 針對 backend 做細節設定,如這裡指定使用 tensorflow v2
--model-control-mode: 指令 model 部署行為,這裡使用 polling 機制
--repository-pool-secs: polling 間隔為 5 秒

3. 點開建立好的 triton server,選擇 Service Info,可以看到開出的 port
> 我們選擇 http endpoint 敲敲看 server 有沒有正常啟動
```shell=
$ curl 202.168.202.41:31226/v2/health/ready -v
* Trying 202.168.202.41:31226...
* Connected to 202.168.202.41 (202.168.202.41) port 31226 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: 202.168.202.41:31226
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
```
> 透過 API 可以看到我們的模型已經被部署了
```shell=
$ curl 202.168.202.41:31226/v2/models/stats
{"model_stats":[{"name":"mnist","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0}},"batch_stats":[]}]}
```

4.接著我們可以實作 client 端程式碼,給他一個圖片做測試 ([範例](#Mnist-triton-client))

:::success
最後輸出結果 2 跟我們預期的一樣
:::
## Deploy new version
:::info
想要更新 model 的話,只需要在 model repository 新增相對應的資料夾,並把新的 model 放進去即可
下圖為新增版本 2 的 mnist model
:::

- 透過 API 可以看到 version 2 已經被部署了
```shell=
$ curl 202.168.202.41:31226/v2/models/mnist/versions/2/stats
{"model_stats":[{"name":"mnist","version":"2","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0}},"batch_stats":[]}]}
```
## 範例
### Mnist training
:::info
以下範例使用 tensorflow_2.4.1_gpu_gemini_2.0 Solution
:::
```python=
# Mnist traning program
!pip install tensorflow_datasets
import tensorflow.compat.v2 as tf
import tensorflow_datasets as tfds
tf.enable_v2_behavior()
## Load dataset
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
try_gcs=True
)
## Define image normalization function
def normalize_img(image, label):
"""Normalizes images: `uint8` -> `float32`."""
return tf.cast(image, tf.float32) / 255., label
## Data preprocess
ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)
## Define model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
tf.keras.layers.Dense(128,activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
## Training
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
)
## Export model
## 模型儲存
model.save('model.savedmodel')
```
### Mnist triton client

```python=
## Install tritonclient package
!pip install tritonclient
## Import packages we need
from PIL import Image
import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.utils import triton_to_np_dtype
## Load and preprocess images
img = Image.open('./2.png').convert('L')
img = img.resize((28, 28))
imgArr = np.asarray(img)/255
imgArr = np.expand_dims(imgArr[:, :, np.newaxis], 0)
imgArr = imgArr.astype(triton_to_np_dtype('FP32'))
display(f'imgArr datatype is {imgArr.dtype}, shape is {imgArr.shape}')
## Init triton_client and define request and response object
triton_client = grpcclient.InferenceServerClient(url='172.16.1.19:31982', verbose=0)
_input = grpcclient.InferInput('flatten_input', imgArr.shape, 'FP32')
_input.set_data_from_numpy(imgArr)
_output = grpcclient.InferRequestedOutput('dense_1')
res = triton_client.infer('mnist', [_input], request_id='1', model_version='2', outputs=[_output])
np.argmax(res.as_numpy('dense_1'))
```
### config.pbtxt
```shell=
name: "mnist"
platform: "tensorflow_savedmodel"
max_batch_size: 32
input [
{
name: "flatten_input"
data_type: TYPE_FP32
format: FORMAT_NHWC
dims: [28, 28, 1]
}
]
output [
{
name: "dense_1"
data_type: TYPE_FP32
dims: [10]
}
]
instance_group [
{
kind: KIND_GPU
count: 2
}
]
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }}]
}}
version_policy { latest { num_versions: 2 } }
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
```