# Triton Inference Server on Gemini AIConsole 如下圖所示,Triton 是一個 Clinet-Server 架構,用於快速架設推論伺服器,支援多 Model 管理、並行化、低延遲、高吞吐量等特性. ![Triton Arch](https://i.imgur.com/8SnIgKS.png) [TOC] ## Prepare model repository :::info 使用者可以先在自己電腦上利用 Docker 建置,熟悉一下 Triton Server 是怎麼運作的, 可以閱讀 [**官方文件**](https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md#quickstart) ::: 1. 開啟一個 tensorflow solution,訓練我們的模型([範例](#Mnist-training)) ![training](https://i.imgur.com/H2PcDhe.png) 2. 訓練好模型後,我們將模型導出得到 **model.savedmodel** ![savedmodel](https://i.imgur.com/wtLcntC.png) 3. 接著我們將訓練好的 model 放入 model repository (此次範例是/home/models),此時你的資料結構會是這個樣子 :::info Triton 以資料夾名稱作為版本號,這裡部署的是 **1** 版 ::: ![model struct](https://i.imgur.com/S5ioVIc.png) 4. 但還沒結束,我們還要告訴 triton 這個 model 部署的參數 ([範例](#config.pbtxt)) ![config](https://i.imgur.com/sLTi1Pa.png) 5. 將以上 config 設定好後,儲存為 config.pbtxt 並放入 mnist 資料夾下,此時會長這個樣子 ![model struct](https://i.imgur.com/K2krGGk.png) :::success 這樣我們就完成訓備模型 -> 準備好 model repository 了 接下來就是啟動 triton server 部署這個 model ::: ## Deploy model with Triton Server on AIConsole 1. 在 Container Site 選擇建立 Triton 服務 ![Create Site](https://i.imgur.com/rNNtWVP.png) 2. 輸入 password 後,輸入 tritonserver 指令 根據本次的範例,需要輸入的指令為 ```shell= tritonserver --model-store=/home/models --backend-config=tensorflow,version=2 --backend-config=tensorflow,allow-soft-placement=0 --model-control-mode=poll --repository-poll-secs=5 ``` --model-store: 選擇 model repository 的位置 --backend-config: 針對 backend 做細節設定,如這裡指定使用 tensorflow v2 --model-control-mode: 指令 model 部署行為,這裡使用 polling 機制 --repository-pool-secs: polling 間隔為 5 秒 ![Site Parameters](https://i.imgur.com/LryM7cU.png) 3. 點開建立好的 triton server,選擇 Service Info,可以看到開出的 port > 我們選擇 http endpoint 敲敲看 server 有沒有正常啟動 ```shell= $ curl 202.168.202.41:31226/v2/health/ready -v * Trying 202.168.202.41:31226... * Connected to 202.168.202.41 (202.168.202.41) port 31226 (#0) > GET /v2/health/ready HTTP/1.1 > Host: 202.168.202.41:31226 > User-Agent: curl/7.77.0 > Accept: */* > * Mark bundle as not supporting multiuse < HTTP/1.1 200 OK ``` > 透過 API 可以看到我們的模型已經被部署了 ```shell= $ curl 202.168.202.41:31226/v2/models/stats {"model_stats":[{"name":"mnist","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0}},"batch_stats":[]}]} ``` ![service info](https://i.imgur.com/0CGN5c3.png) 4.接著我們可以實作 client 端程式碼,給他一個圖片做測試 ([範例](#Mnist-triton-client)) ![client](https://i.imgur.com/E0ky5AP.png) :::success 最後輸出結果 2 跟我們預期的一樣 ::: ## Deploy new version :::info 想要更新 model 的話,只需要在 model repository 新增相對應的資料夾,並把新的 model 放進去即可 下圖為新增版本 2 的 mnist model ::: ![version 2](https://i.imgur.com/IlbAEVF.png) - 透過 API 可以看到 version 2 已經被部署了 ```shell= $ curl 202.168.202.41:31226/v2/models/mnist/versions/2/stats {"model_stats":[{"name":"mnist","version":"2","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0}},"batch_stats":[]}]} ``` ## 範例 ### Mnist training :::info 以下範例使用 tensorflow_2.4.1_gpu_gemini_2.0 Solution ::: ```python= # Mnist traning program !pip install tensorflow_datasets import tensorflow.compat.v2 as tf import tensorflow_datasets as tfds tf.enable_v2_behavior() ## Load dataset (ds_train, ds_test), ds_info = tfds.load( 'mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True, with_info=True, try_gcs=True ) ## Define image normalization function def normalize_img(image, label): """Normalizes images: `uint8` -> `float32`.""" return tf.cast(image, tf.float32) / 255., label ## Data preprocess ds_train = ds_train.map( normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE) ds_train = ds_train.cache() ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples) ds_train = ds_train.batch(128) ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE) ds_test = ds_test.map( normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE) ds_test = ds_test.batch(128) ds_test = ds_test.cache() ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE) ## Define model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28, 1)), tf.keras.layers.Dense(128,activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( optimizer=tf.keras.optimizers.Adam(0.001), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()], ) ## Training model.fit( ds_train, epochs=6, validation_data=ds_test, ) ## Export model ## 模型儲存 model.save('model.savedmodel') ``` ### Mnist triton client ![mnist-2](https://i.imgur.com/rWIMSD8.png) ```python= ## Install tritonclient package !pip install tritonclient ## Import packages we need from PIL import Image import numpy as np import tritonclient.grpc as grpcclient from tritonclient.utils import triton_to_np_dtype ## Load and preprocess images img = Image.open('./2.png').convert('L') img = img.resize((28, 28)) imgArr = np.asarray(img)/255 imgArr = np.expand_dims(imgArr[:, :, np.newaxis], 0) imgArr = imgArr.astype(triton_to_np_dtype('FP32')) display(f'imgArr datatype is {imgArr.dtype}, shape is {imgArr.shape}') ## Init triton_client and define request and response object triton_client = grpcclient.InferenceServerClient(url='172.16.1.19:31982', verbose=0) _input = grpcclient.InferInput('flatten_input', imgArr.shape, 'FP32') _input.set_data_from_numpy(imgArr) _output = grpcclient.InferRequestedOutput('dense_1') res = triton_client.infer('mnist', [_input], request_id='1', model_version='2', outputs=[_output]) np.argmax(res.as_numpy('dense_1')) ``` ### config.pbtxt ```shell= name: "mnist" platform: "tensorflow_savedmodel" max_batch_size: 32 input [ { name: "flatten_input" data_type: TYPE_FP32 format: FORMAT_NHWC dims: [28, 28, 1] } ] output [ { name: "dense_1" data_type: TYPE_FP32 dims: [10] } ] instance_group [ { kind: KIND_GPU count: 2 } ] optimization { execution_accelerators { gpu_execution_accelerator : [ { name : "tensorrt" parameters { key: "precision_mode" value: "FP16" }}] }} version_policy { latest { num_versions: 2 } } dynamic_batching { preferred_batch_size: [ 4, 8 ] max_queue_delay_microseconds: 100 } ```