Nvidia Triton 21.08 TensorRt Memory Leak

# Nvidia Triton 21.08 TensorRt Memory Leak ###### tags: `NVIDIA Triton Inference Server` **Description**: I updated my Triton server version from 21.02 to 21.08 (docker image) to develop a complete service with BLS. But I don't have enough GPU memory to load every models at the initial. I used `load`/`unload` client API which you offered to reach the purpose. Here comes the same problem on version 21.02 and 21.08. 1. `unload` API will not exactly release model memory on all the backends (PytTrch, ONNX, TensorFlow, TensorRT). First, PyTorch and ONNX backends do not fully free up GPU memory, but wouldn't cause a memory leak. In other words, the maximum GPU memory usage on the same model is fixed, and the GPU memory would reduce unloading the model but still occupy a large amount of memory. Second, the TensorFlow backend would not release memory at all. (p.s. In other issue says it's TensorFlow's bug). 2. Use `unload`/`load` API to dynamic load TensorRT model is not only free up GPU memory incompletely but also cause memory leak when reload the same model. what I expects is the fixed maximum of GPU memory usage when the same model reload just like Pytorch, ONNX. Although the GPU memory usage reduced when unload the TensorRT model, the memory usage will increase when reload it. **Triton Information** Triton server version: 21.02 / 21.08 Triton server image: I used the image that you [offered](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver), and I tried to build by myself, and got the problem. ## To Reproduce Use version 21.08 for example. 1. `docker run --rm -it --gpus all --name Triton_Server --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/infor/model-repository:/models nvcr.io/nvidia/tritonserver:21.08-py3 tritonserver --model-control-mode=explicit --model-repository=/models/ --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1` 2. use this script(see Client-Testing-Script below) to test GPU memory on each model, and use nvidia-smi to observe the GPU memory usage. (p.s. the tested TensorRT model was converted from densenet_onnx which is provided from model-repository examples) ## Describe Models and Testing Result * Initial Triton Inference server ![image](https://user-images.githubusercontent.com/8478501/131944365-e8662805-c586-4449-ab44-6e0b74dbb98d.png) 1. PyTorch: * model: [ArcFace from insightface](https://github.com/deepinsight/insightface) * config: see ArcFace-Config below * result: the first execution didn't use `unload`, the second one used `unload`. ![mem_test py - triton_client tirton_pytorch 2021-09-03 11-33-31](https://user-images.githubusercontent.com/8478501/131947700-0766ede2-a9fd-4a9e-a728-c21c777bc653.gif) * log: [tritonserver_log_pytorch.log](https://github.com/triton-inference-server/server/files/7103200/tritonserver_log_pytorch.log) 2. ONNX: * model: densenet_onnx * without `unload`: ![image](https://user-images.githubusercontent.com/8478501/131945479-61eb9959-383d-4092-97e7-379c4fe91882.png) * with `unload`: ![mem_test py - triton_clienttirton_onnx_with_unloading_2021-09-03 11-29-40](https://user-images.githubusercontent.com/8478501/131947129-6eb90d2b-b4e1-48e8-9a48-6413ade1484e.gif) * log: [tritonserver_log_onnx.log](https://github.com/triton-inference-server/server/files/7103197/tritonserver_log_onnx.log) 3. TensorFlow: * model: Inception_graphdef * without `unload`: ![image](https://user-images.githubusercontent.com/8478501/131944844-f36d0db9-cd4a-47a3-9486-1d4347d71fdc.png) * with `unload`: ![image](https://user-images.githubusercontent.com/8478501/131944914-4d9fb366-204d-4379-95cd-e9f29fc6f2b7.png) * log: [tritonserver_log_tensorflow.log](https://github.com/triton-inference-server/server/files/7103196/tritonserver_log_tensorflow.log) 4. TensorRT: * model: densenet_trt (converted from densenet_onnx) * result: the first execution didn't use `unload`, the second one used `unload`. ![mem_test py - triton_client tirton_trt_2021-09-03 11-36-45](https://user-images.githubusercontent.com/8478501/131947753-c46f292c-15a9-496a-b199-5cc08ad28dd7.gif) * log: [tritonserver_log_tensorrt.log](https://github.com/triton-inference-server/server/files/7103202/tritonserver_log_tensorrt.log) **Expected behavior** My Final goal is when the client request `unload` model, then model will fully release from the GPU memory. At least, There won't cause memory leak on TensorRT model. ## Client Testing Script ``` python #!/usr/bin/env python import argparse import time import sys import tritonclient.grpc as grpcclient from tritonclient.utils import InferenceServerException if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-v', '--verbose', action="store_true", required=False, default=False, help='Enable verbose output') parser.add_argument('-u', '--url', type=str, required=False, default='localhost:8001', help='Inference server URL. Default is localhost:8001.') parser.add_argument( '-m', '--model_name', type=str, required=False, default='preprocess_inception_ensemble', help='Name of model. Default is preprocess_inception_ensemble.') parser.add_argument('-d', '--close_unload_model', action="store_true", required=False, default=False, help='Iteration number') parser.add_argument('-l', '--loop', type=int, required=False, default=10, help='Iteration number') FLAGS = parser.parse_args() for i in range(FLAGS.loop): print("iteration: {}".format(i+1)) print("*" * 50) try: triton_client = grpcclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose) except Exception as e: print("\tcontext creation failed: " + str(e)) sys.exit(1) model_name = FLAGS.model_name load_start = time.time() triton_client.load_model(model_name) load_end = time.time() print("\tLoading time: {:.2f}ms".format((load_end - load_start) * 1000)) if not triton_client.is_model_ready(model_name): print('\tFAILED : Load Model') sys.exit(1) else: print("\tModel loading pass") # Make sure the model matches our requirements, and get some # properties of the model that we need for preprocessing try: model_metadata = triton_client.get_model_metadata( model_name=FLAGS.model_name, model_version="1") model_config = triton_client.get_model_config( model_name=FLAGS.model_name, model_version="1" ) # print("model config: {}".format(model_config)) # print("model metadata: {}".format(model_metadata)) print("\tGet config and metadata pass") except InferenceServerException as e: print("\tfailed to retrieve the metadata or config: " + str(e)) sys.exit(1) if not FLAGS.close_unload_model: unload_start = time.time() triton_client.unload_model(model_name) unload_end = time.time() print("\tUnloading time: {:.2f}ms".format((unload_end - unload_start) * 1000)) if triton_client.is_model_ready(model_name): print('\tFAILED : Unload Model') sys.exit(1) else: print("\tModel unloading pass") ``` ## ArcFace Config ```protobuf name: "arcface_r100_torch" platform: "pytorch_libtorch" max_batch_size : 2 input [ { name: "input__0" data_type: TYPE_FP32 format: FORMAT_NCHW dims: [ 3 , 112, 112 ] } ] output [ { name: "output__0" data_type: TYPE_FP32 dims: [ 512 ] reshape { shape: [ 1, 512 ] } } ] ``` ## Github Issue The [issue](https://github.com/triton-inference-server/server/issues/3320) on github is closed and the bug is fixed on version 21.10 or later. Here is another [issue](https://github.com/triton-inference-server/server/issues/3758)(Memory not being released after triton inference - Python) which is using Python Backend at Triton Inference Server to infer the ONNX model.