# Investigate how to improve performace EKYC API
## Problem:
- How to maximizing GPU Utilization?
+ Running a single model per GPU may be inefficient.
+ Running multiple models on a single GPU will not automatically run them concurrently to maximize GPU utilization.
- Enabling Real-Time and Batch Inference:
+ There are two types of inference. If our application needs to respond to the user in real-time, then inference needs to complete in real-time too. Because latency is a concern, the request cannot be put in a queue and batched with other requests. On the other hand, if there is no real-time requirement, the request can be batched with other requests to increase GPU utilization and throughput.
## Solution:
1. NVIDIA TensorRT Inference Server
+ [Video demo](https://www.youtube.com/watch?v=1DUqD3zMwB4&feature=youtu.be)
+ Summary:
* Without TensorRT Inference Server

Performances 1200 request image classification:

Increase request image classification:

=> Bottleneck: 5000 images/s
* Deploy on TensorRT Inference Server



==> Increase Bottleneck to 15000 images/s
+ How to apply TensorRT Inference Server
- Deploying and Scaling AI Applications with the NVIDIA TensorRT Inference Server on Kubernetes [video](https://www.youtube.com/watch?v=SekmR9YH4xQ)
-
2. Leverage Redis to efficiently batch process
+ [Video Demo solutions](https://youtu.be/1uoHYcMZ7nc)
+ Github example project: https://github.com/shanesoh/deploy-ml-fastapi-redis-docker
+ Github example 2 : https://github.com/stix121/keras-rest-api
# References
1. [Easily Deploy Deep Learning Models in Production](https://medium.com/dataseries/easily-deploy-deep-learning-models-in-production-13db48071578)
2. [Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker](https://medium.com/analytics-vidhya/deploy-machine-learning-models-with-keras-fastapi-redis-and-docker-4940df614ece)