Model-Serving - HackMD

# Model-Serving ###### tags: `GPU` ###### members: @李睦樂 ## Repo [multigpu-scheduler](https://github.com/nycu-caslab/multigpu-scheduler) [tf-worker](https://github.com/nycu-caslab/tf-worker) [main-controller](https://github.com/nycu-caslab/main-controller) ## 會議記錄彙整 ## 投影片 ## 筆記 * System Architecture ![](https://i.imgur.com/nloRFZj.png) ## 資源 ## 相關論文 ## Issues * [ ] Main Controller * [ ] Scheduler * [ ] Scheduling policy * [ ] Scheduling action * [x] assign(device, task) * Assign task to one GPU * [ ] move(device1, device2) * Move task from device1 to device2 * [ ] moveAndAssign(device1, device2, task) * Move task from device1 to device2 ans assign task to device2. * [ ] batchAndAssign(device, task) * preform batching on task in device and current task. * [ ] GPU worker * [x] Run VGG16 * [x] Task Queue * [ ] Worker Container * [ ] Task Types * [ ] Batchdata(data1, data2) * Move data1 and data2 to GPU memory * Batch two data into one tensor * [ ] Movedata(data, srcGPU) * Move data from srcGPU to current GPU memory * [ ] PCIE communication * [ ] Loadweight(layer) * Check if the weight is already on current GPU or load from model metadata. * [x] Forward(layer, data) * Forward layer * [x] Ops * [x] Conv * [x] Maxpool * [x] FC * [x] Activation * [ ] Model metadata * [ ] Model schema design * [x] Model Topology * [ ] Layer Latencies * [ ] Movement Overhead * [ ] Weights * [ ] Done Queue * [ ] Result format ## Weekly Progress [12/14/2022 - 12/21/2022](https://hackmd.io/a6ziQxm5TFy_yQWkIE0etg) [12/07/2022 - 12/14/2022](https://hackmd.io/Gd-aAPDCTb-QtzNjP0UCpw) [11/29/2022 - 12/07/2022](https://hackmd.io/A7hToauwT3CEpWk91d_5Iw) [11/23/2022 - 11/29/2022](https://hackmd.io/gksg2YgBQniU8xlG0SC0Rw) [11/16/2022 - 11/23/2022](https://hackmd.io/JLsbOKTUR_W0R46Te6XoXQ) [11/09/2022 - 11/16/2022](https://hackmd.io/WDhiAU7oTraoiyi-s9VxxQ) [11/02/2022 - 11/09/2022](https://hackmd.io/@3JJUByncSm2SuNd2f1w5aw/Hk9Ymhurs) [10/26/2022 - 611/02/2022](https://hackmd.io/tm201UJsQuaNYw0CITTSwg) [10/20/2022 - 10/26/2022](https://hackmd.io/QrCegMMGS7G5TdpUchgSMQ) [10/12/2022 - 10/19/2022](https://hackmd.io/LFy-uccJQN6ahVeS0w0FHA) ## History * Resourse utilization of different layers * Data: [Google Sheet](https://docs.google.com/spreadsheets/d/1UbBRH6ctvjM0pi2Iidl7ZwupaKZ8FprAH7K9weayzrc/edit#gid=1056005320) * Code: [Github](https://github.com/s094392/nsys_pytorch) * Use nsys and pyprof to monitor resources of every layers. * Hypothesis: Resource contention is the main reason that causes additional latency when multiple layers run concurrently. We can reduce the latency by ensuring there is no resource contention when layers run concurrently. * Hypothesis: Use kernel fusion to merge two kernels into one kernel to raise the gpu utilization. * Any limitation likes resource contention? * https://cs.nyu.edu/~lingfan/resources/batchmaker-eurosys18.pdf * Layer-wise scheduler * Code: [Github](https://github.com/s094392/Pytorch-Serving) * Use thread and mutex to enable concurrent layer distributing on different gpu. * Hypothesis: If we split the inference task into layers. The layer scheduler has strong influence on the tail lantency of tasks of different model run on different gpu. * Benefit of layer-wise scheduling: * Support premeemption. * Easier to calculate the utilization. * Batching. * Latency of different models among different gpus * Data: [Google Sheet](https://docs.google.com/spreadsheets/d/1BdJoBiTH_m6WmCoEA6zR2VXSDucVn_ZU7nZ79eOk-wU/edit#gid=0) * Insight: * Some models are sensitive to gpus, some models are not. * In RNN, 1080 is faster than 3080 when batch size is 1. ## Paper structure * Introduction * Background * Data center * Trainning vs Inference * Motivation * Utilization * The utilization of some operations is not good. * Latency between GPU * Some operations are sensitive to GPU computing power. * Task size * The task size could be smaller for more flexible scheduling. * Data movement latency * The data movement latency is some times worth to move one running task to another idel GPU. * Architecture * Profiling * Scheduler * FIFS * Our * Model metadata * Methodology * Tensorflow c++ * Container-based GPU worker and queue * Evaluation * Conclusion ## Template ``` ## Time: 8/17/2021 - 8/14/2021 ### 1. Successes Last Week ### 2. Progress and Problems Last Week ### 3. Not Resolved Problems ### 4. Goals for next week ``` ## Time: 7/12/2021 - 7/19/2021 ### 1. Successes Last Week * Latency of different models among different gpus (李睦樂） * ResNet, DLRM, RNN, Bert (MLPerf) * Extract the kernels of models by nsys * 看[線上課程](http://introtodeeplearning.com/2020/index.html) 了解到 CNN RNN 的原理 (楊秉宇） * Know about some cuda 語法 (楊秉宇） * 安裝 pytorch 跟 tensorflow 跟跑一寫基本的 pytorch 跟 tensorflow 語法 * 學了 machine learning 的基本知識: cross validation, loss function(hinge loss, cross-entropy loss), optimization(what is Gradient and how to compute gradient), overfitting, 以及一些 image classification 的議題(KNN等) (邱頎霖) * 安裝好 pytorch 並閱讀了一些 [官方的 tutorials](https://pytorch.org/tutorials/) 與[官方中文版的教學](https://pytorch.apachecn.org/docs/1.7/), 並以 CIFAR10 的 sample code 去理解上述知識如何在 pytorch 中實現 (邱頎霖) ### 2. Progress and Problems Last Week * In RNN, 1080 is faster than 3080 when batch size is 1. [name=李睦樂] * TensorRT seems good. * 知道如何實做CNN RNN * 1. 對於 Neural Networks 與 CNN 等了解的不夠多, 可能是我看的太慢 2. 雖然可以跑 sample code, 但有點看不太懂官方中文版的教學, 可能下週需要多看點 pytorch document [name=邱頎霖] ### 3. Not Resolved Problems * The checking list of problems you have not resolved yet. * 還沒有用真正實做過 CNN RNN 也還沒有用過 pytorch 跟 Tersorflow (楊秉宇) ### 4. Goals for next week * Try to use nsys to extract all kernels issued by cudnn example. * 產生一個micro benchmark(n% of resources) (某一個CNN layer) * 讓 micro benchmark 同時跑 (用自己做的pytorch Layer-wise scheduler) 看看假設是否正確(We can reduce the latency by ensuring there is no resource contention when layers run concurrently.) (李睦樂) * 比較一下cuDNN benchmark的結果 * x-axis: resource utilization (100%) * y-axis: additional overhead * 整理 machine learning 的基本知識筆記放到 hackmd (batch size) (邱頎霖) * 將 Neural Networks 與 CNN 看完, 並且理解 pytorch sample code (邱頎霖) * 試著去跑cudnn sample 裡面的CNN和RNN，和認識cudnn API https://github.com/Hardware-Alchemy/cuDNN-sample https://docs.nvidia.com/deeplearning/cudnn/api/index.html (楊秉宇) * 認識nvprof和了解如何用nvprof去量每一個cudnn的API和kernel (楊秉宇) * 整理pytorch and tensorflow的學習資料網頁放上來hackmd (楊秉宇) --- ## Time: 7/19/2021 - 7/26/2021 ### 1. Successes Last Week * [name=楊秉宇] 1. [語法整理](https://hackmd.io/IpHknJZeQL2ZpXCVAlYEpw) 2. 看了一下pytorch 的語法教學(感覺要實際寫一下deep learning 會比較快入手) * [name=邱頎霖] 1. 接續上週進度, 看了 neural_network, backpropagation, CNN, 並補了[筆記](https://github.com/chilin0525/ML_note) 2. 看了 [Pytorch:TRAINING A CLASSIFIER](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py) sample code, 能理解理論部份在 Pytorch 中大概要怎麼實現 * [name=李睦樂] * 完成可以用不同stream跑的conv [code](https://gist.github.com/s094392/3b51e3cf857e9ba54fc6439dc5908ddf) ### 2. Progress and Problems Last Week * [name=李睦樂] * Pytorch streams API don't execute concurrently, However Same code in CUDA does. * https://github.com/pytorch/pytorch/issues/48279 * https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ * Kernel executions don't fully overlap With 2 streams. [my code](https://gist.github.com/s094392/3b51e3cf857e9ba54fc6439dc5908ddf) ![](https://i.imgur.com/jp9O5Sk.png) ![](https://i.imgur.com/LpmVSpv.png) ### 3. Not Resolved Problems cudnn sample code 無法在機器上跑 ### 4. Goals for next week * [name=楊秉宇] * 用 Pytorch 練習寫一些 model (e.q. VGG, AlexNet...) * 加上cudnnSetStream在cudnn裡面，並且用nvprof去量是否兩個stream有跑在一個GPU上 * 去看並且去跑 DeepBench (Baidu CNN, RNN, GEMM cuDNN, cuBLAS 函式庫寫成的benchmark) * 看TensorRT (Nvidia Inference API), 用nvprof 去量量看API的硬體使用率 * [name=邱頎霖] 1. 將筆記中殘缺部份補完(CNN，RNN，Transformer) 2. 將 Pytorch 細節看完，跑一些pytorch sample codes，和練習寫一些model 3. 教授上週提到的 Cudnn API * [name=李睦樂] * 繼續調 cudnn 參數 * RNN * Transfomer * multiHeadAttention (/home/ttyeh/workspace/stream_example/multiHeadAttention) * 試試看 tensorRT ## Time: 7/26/2021 - 8/3/2021 ### 1. Successes Last Week * [name=李睦樂] * 調 cudnn 參數 * 改成一個stream跑多個layer，嘗試 hide launch overhead * 讓 layer 有 overlap 到 ![](https://i.imgur.com/IcuabDZ.png) * 看起來 duration 越長，overlap造成的 benefit 愈低 [data](https://docs.google.com/spreadsheets/d/1JfsELv8yBYohv-NtRnrLPd53i0vAgaWho8989fBu2jo/edit#gid=0) * precentage: 實際上的latency/沒有overlay原本需要跑多久 ![](https://i.imgur.com/2L6ZhIy.png) * [name=邱頎霖] * 完成 [CNN](https://github.com/chilin0525/ML_note/blob/main/ML_note/04_CNN.md) [RNN](https://github.com/chilin0525/ML_note/blob/main/ML_note/06_RNN.md) 筆記 * 用 Pytorch 寫了 Lenet-5, AlexNet ### 2. Progress and Problems Last Week ### 3. Not Resolved Problems * [name=李睦樂] * 不知道為甚麼 kernel size 條小，thread數量反而增加 ![](https://i.imgur.com/A9xnzyC.png) * thread 條很小時，會有一個在很前面 ![](https://i.imgur.com/4fCEuvB.png) ### 4. Goals for next week * [name=邱頎霖] 1. 看 transformer 2. 用 Pytoch 練習 3. 看 cudnn API and nvprof * [name=李睦樂] * 研究fusion * 有甚麼限制，如資源限制 * 限制可以預測？ * 那些layer可以fusion * ## Time: 8/3/2021 - 8/10/2021 ### 1. Successes Last Week * [name=李睦樂] * Read some materials about kernel fusion * https://pytorch.org/tutorials/intermediate/fx_conv_bn_fuser.html * https://github.com/pytorch/pytorch/blob/orig/release/1.8/torch/fx/experimental/fuser.py * https://zhuanlan.zhihu.com/p/49329030 * [name=邱頎霖] * 看完 transformer [(link)](https://github.com/chilin0525/ML_note/blob/main/ML_note/07_transformer.md) * [name=楊秉宇] 看這個學 [pytorch](https://classroom.udacity.com/courses/ud188/lessons/c5706f76-0e30-4b48-b74e-c19fafc33a75/concepts/037b1900-5331-4ab0-805b-7b55b802bff7)並試著是利用 pytorch 刻出 NN 跟 RNN [一些語法整理](https://hackmd.io/IpHknJZeQL2ZpXCVAlYEpw) ### 2. Progress and Problems Last Week * [name=李睦樂] * Tried module fusion on pytorch framework. ([reference](https://pytorch.org/tutorials/recipes/fuse.html)) * Pytorch fusion，我感覺不可以runtime dynamic * Resnet ![](https://i.imgur.com/BZKKQ4j.png) 0.00712s ![](https://i.imgur.com/GQ0YcA4.png) 0.00623s * [Data](https://docs.google.com/spreadsheets/d/1coFw4MgEOxrbFConOl5muuP8lceneM-_Xhf78NBPNnY/edit#gid=0) * Supported fusion types * conv, bn * conv, bn, relu * conv, relu * linear, relu * bn, relu * 效果普通 * [TensorRT fusion](https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#fusion-types) * Supported fusion types * ReLU ReLU Activation * Convolution and ReLU Activation * Convolution and GELU Activation * Convolution and Clip Activation * Use nsystem to profile fused module * TensorRT Reference * https://medium.com/@abhaychaturvedi_72055/understanding-nvidias-tensorrt-for-deep-learning-model-optimization-dad3eb6b26d9 * https://github.com/NVIDIA/TensorRT ### 3. Not Resolved Problems * 增加 utilization * 要做batching * 要對到shape * conv 會有 filter 的問題 * linear 可以 * streaming * kernel launch overhead ### 4. Goals for next week ## Time: 8/10/2021 - 8/17/2021 ### 1. Successes Last Week * [name=邱頎霖] * 看 TensorRT API, 並嘗試比較 TensorRT 在 inference 時與 Pytorch 和 ONNX runtime 的 performance, 結果如下圖 ([code link](https://github.com/chilin0525/ML_note/blob/main/Practice/TensorRT/resnet.ipynb)) * <img src="https://i.imgur.com/p6YhSem.png" width=500> * [name=楊秉宇] * 看一點 TensorRT API，並且看了這個[課程](https://classroom.udacity.com/courses/ud188)學習 ### 2. Progress and Problems Last Week * [name=李睦樂] * TensorRT Horizontal layer fusion * https://developer.nvidia.com/blog/deploying-deep-learning-nvidia-tensorrt/ * https://leimao.github.io/blog/Neural-Network-1x1-Convolution-Fusion/ * ![](https://i.imgur.com/rCSqXqu.png) * 我自己想覺得是同一個 input 的話應該可以 fuse，基本上就有點像是增加 filter 的數量 * Trying to build the example module and replace the 1x1 conv with 2x2, 3x3, 4x4... ![](https://i.imgur.com/VeWJ9Id.png) * Use TFTRT(tensorflow with tensorRT) to implement it. * [name=邱頎霖] * ONNX runtime 的結果蠻奇怪的, 結果應該要比 Pytorch 還快比 TensorRT 慢。有找到類似的 [issue](https://github.com/microsoft/onnxruntime/issues/2404) 裡面提到 ONNX run 的時候還需要做 memory copy to GPU 的步驟, 因此後來也包含 memory copy to GPU 的步驟後全部重新量測時間, 但結果沒有太大變化, 推測可能是我量測時間的方式有問題 ### 3. Not Resolved Problems * The Tensorflow requires tensorRT 7.0. However the avalivable version now is 8. ### 4. Goals for next week * Try Tensorflow with tensorRT * TensorRT * 如何優化 model https://developer.nvidia.com/blog/deploying-deep-learning-nvidia-tensorrt/ * Convolution Horizontal Fusion * https://leimao.github.io/blog/Neural-Network-1x1-Convolution-Fusion/ * 如何轉graph * https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html * Profile * Profile process * nsys profile -f true -o net --export sqlite python fused_res.py * Convert to json * nsys stats -r gputrace net.qdrep -f json -o net * UI * mobaxterm: nsys-ui * Pyprof * https://docs.nvidia.com/deeplearning/frameworks/pyprof-user-guide/profile.html * https://github.com/NVIDIA/PyProf * [name=楊秉宇] 研究一下如何利用數學方法達到加速運算 * [name=邱頎霖] * 研究 tensorRT 優化方式 * [ONNC](https://github.com/ONNC/onnc) ## Time: 8/17/2021 - 8/23/2021 ### 1. Successes Last Week * [name=李睦樂] * 做出 Nvidia blog 上的 example * ![](https://i.imgur.com/JXf8yaW.png) * [name=邱頎霖] * 將 tensorRT 從 parse ONNX format 到做 layer fusion 的過程給輸出出來, [Link](https://github.com/chilin0525/ML_note/blob/main/Practice/TensorRT/verbose) ### 2. Progress and Problems Last Week * [name=李睦樂] * 好像沒辦法把 visualize a tensorRT engine(optimized model) * 從 log 或是 raw file看起來好像沒有 1x1 horizional fusion * [name=邱頎霖] * 如學長上面所提到的目前沒辦法 visualize a tensorRT engine, [Link](https://forums.developer.nvidia.com/t/visualizing-tensorrt-engine/69032) * tensorRT 做 layer fusion 等機制並沒有公開, Github 上僅為部份內容 (包含 ONNX parser, caffe parser 以及一些 plugin 等), 目前只能夠透過 verbose 資訊知道每做一次 layer fusion 時 TensorRT 會有多個 Tactic 應對並且事先測試過時間, 最後選擇最佳者 * [What is the meaning of Tactic when using trtexec #52](https://github.com/NVIDIA/TensorRT/issues/52) ### 3. Not Resolved Problems ### 4. Goals for next week * TODO: 1. 目前 tensorRT, Pytorch 的限制 [name=邱頎霖] 2. Lazy batching [name=李睦樂] 3. 測量各個影響 latency 的數量級 (選6個model) * The execution time of entire model in different gpu [name=楊秉宇] * Batch benefit * Layers on different gpu [name=邱頎霖] * Batch benefit * Data movement in GPU [name=李睦樂] 4. 確定一下整個系統的架構，execution model 和實做方式 [name=李睦樂] * pretrained model: * [ONNX model](https://github.com/onnx/models) * [Pytorch pretrained model](https://pytorch.org/vision/stable/models.html) * model: | model | ONNX pretrained model | Pytorch pretrained model | |:------------------------: |:---------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:| | Resnet50-v1 | [link](https://github.com/onnx/models/blob/master/vision/classification/resnet/model/resnet50-v1-7.onnx) | ```model = torchvision.models.resnet50(pretrained=True, progress=False).eval()``` | | Resnet101-v1 | [link](https://github.com/onnx/models/blob/master/vision/classification/resnet/model/resnet101-v1-7.onnx) | ```model = torchvision.models.resnet101(pretrained=True, progress=False).eval()``` | | GoogleNet | [link](https://github.com/onnx/models/blob/master/vision/classification/inception_and_googlenet/googlenet/model/googlenet-9.onnx) | ```model = torchvision.models.googlenet(pretrained=True, progress=False).eval()``` | ## Time: 8/23/2021 - 8/30/2021 ### 1. Successes Last Week * [name=李睦樂] * Case Study [Code](https://github.com/s094392/nsys_pytorch/blob/main/Scheduler.ipynb) * 使用 Resnet, Alexnet * 量出每一層要轉換 GPU 時，需要的 Data movement overhead * ![](https://i.imgur.com/GakMZ4j.png) * Alexnet 如果一開始在慢的 GPU 上時，搬到快的整體會比較快 * ![](https://i.imgur.com/WQHuzig.png) * Data movement * ![](https://i.imgur.com/BVuLXYA.png) * Result * ![](https://i.imgur.com/bEsL0kP.png) * [name=邱頎霖] * 目前 tensorRT, Pytorch 的限制 * TensorRT * 不僅使用之前討論過得 layer fustion, fixed precision 等方式優化, 同時也針對 GPU device 進行優化, 因此 TensorRT engine 基本上綁定在被 build 的裝置上, 而其他 ML framework 訓練好的 model 仍可以在不同裝置上進行使用 * Pytorch * 目前仍然不支援 ONNX to Pytorch, 但其他 ML framework 大多數都有支援, [issue](https://github.com/pytorch/pytorch/issues/21683) * 測量各個影響 latency 的數量級 * 因為不確定要用 ONNX 提供的 pretrained model 還是 Pytorch 提供的, 因此目前兩個都測, 但用下來感覺 ONNX 提供的有點問題 * 使用 nsight system 量測, 發現與 TensorRT verbose 提供的各 layer 時間有差異, 應該是 nsight system 會比較準確 * model time = HtoD + exec time + DtoD + DtoH |model name|model time|gpu type|HtoD|execution time|DtoD|DtoH|runtime| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |googlenet|6.055912 ms|['GeForce RTX 2060 (0)']|2.844639 ms|3.185160 ms|0.007648 ms|0.018465 ms|Pytorch| |resnet50|18.424824 ms|['GeForce RTX 2060 (0)']|12.564461 ms|5.834091 ms|0.007647 ms|0.018625 ms|Pytorch| |resnet101|30.756969 ms|['GeForce RTX 2060 (0)']|20.075831 ms|10.655089 ms|0.007519 ms|0.018530 ms|Pytorch| |googlenet-9|7.135825 ms|['GeForce RTX 2060 (0)']|5.186143 ms|1.600915 ms|0.347424 ms|0.001343 ms|TensorRT| |resnet50-v1-7|22.124697 ms|['GeForce RTX 2060 (0)']|17.387894 ms|3.511690 ms|1.223161 ms|0.001952 ms|TensorRT| |resnet101-v1-7|47.378596 ms|['GeForce RTX 2060 (0)']|37.223101 ms|6.692185 ms|3.461422 ms|0.001888 ms|TensorRT| |googlenet|7.135536 ms|['GeForce RTX 2060 (0)']|5.065822 ms|1.710677 ms|0.357725 ms|0.001312 ms|TensorRT| |resnet50|22.005743 ms|['GeForce RTX 2060 (0)']|16.928080 ms|3.795649 ms|1.280094 ms|0.001920 ms|TensorRT| |resnet101|46.448275 ms|['GeForce RTX 2060 (0)']|35.944787 ms|6.984081 ms|3.517135 ms|0.002272 ms|TensorRT| |googlenet|9.082463 ms|['GeForce GTX 1080 Ti (1)']|4.440801 ms|4.616384 ms|0.007360 ms|0.017918 ms|Pytorch| |resnet50|21.44728 ms|['GeForce GTX 1080 Ti (1)']|15.858027 ms|5.563778 ms|0.007297 ms|0.018178 ms|Pytorch| |resnet101|38.115108 ms|['GeForce GTX 1080 Ti (1)']|27.639287 ms|10.451146 ms|0.007648 ms|0.017027 ms|Pytorch| |googlenet-9|7.940634 ms|['GeForce GTX 1080 Ti (1)']|6.486625 ms|1.082441 ms|0.370416 ms|0.001152 ms|TensorRT| |resnet50-v1-7|22.649471 ms|['GeForce GTX 1080 Ti (1)']|19.537028 ms|2.233684 ms|0.876263 ms|0.002496 ms|TensorRT| |resnet101-v1-7|49.527455 ms|['GeForce GTX 1080 Ti (1)']|41.803510 ms|4.868540 ms|2.853677 ms|0.001728 ms|TensorRT| |googlenet|7.846166 ms|['GeForce GTX 1080 Ti (1)']|6.286488 ms|1.175152 ms|0.383374 ms|0.001152 ms|TensorRT| |resnet50|22.328749 ms|['GeForce GTX 1080 Ti (1)']|18.896681 ms|2.314972 ms|1.115528 ms|0.001568 ms|TensorRT| |resnet101|48.357397 ms|['GeForce GTX 1080 Ti (1)']|41.180663 ms|4.258505 ms|2.916661 ms|0.001568 ms|TensorRT| * nsight system 詳細資訊: | model | pretrained model source | GPU | runtime | layers | |:--------- |:-----------------------:|:----:|:--------:|:------ | | Resnet50 | ONNX | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet50-v1-7-GTX_1080_Ti_gputrace) | | Resnet50 | ONNX | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet50-v1-7-RTX_2060_gputrace) | | Resnet50 | Pytorch | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet50-GTX_1080_Ti_gputrace) | | Resnet50 | Pytorch | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet50-RTX_2060_gputrace) | | Resnet50 | Pytorch | 1080 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/resnet50-GTX_1080_Ti_gputrace) | | Resnet50 | Pytorch | 2060 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/resnet50-RTX_2060_gputrace) | | Resnet101 | ONNX | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet101-v1-7-GTX_1080_Ti_gputrace) | | Resnet101 | ONNX | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet101-v1-7-RTX_2060_gputrace) | | Resnet101 | Pytorch | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet101-GTX_1080_Ti_gputrace) | | Resnet101 | Pytorch | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/resnet101-RTX_2060_gputrace) | | Resnet101 | Pytorch | 1080 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/resnet101-GTX_1080_Ti_gputrace) | | Resnet101 | Pytorch | 2060 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/resnet101-RTX_2060_gputrace) | | Googlenet | ONNX | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/googlenet-9-GTX_1080_Ti_gputrace) | | Googlenet | ONNX | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/googlenet-9-RTX_2060_gputrace) | | Googlenet | Pytorch | 1080 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/googlenet-GTX_1080_Ti_gputrace) | | Googlenet | Pytorch | 2060 | TensorRT | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/googlenet-9-RTX_2060_gputrace) | | Googlenet | Pytorch | 1080 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/googlenet-GTX_1080_Ti_gputrace) | | Googlenet | Pytorch | 2060 | Pytorch | [Link](https://github.com/chilin0525/model-profiling/blob/main/inference_time/nsys/pytorch/googlenet-RTX_2060_gputrace) | ### 2. Progress and Problems Last Week 1. LazyBatching [name=李睦樂] * LazyBatching 有作 GPU-based inference systems. * LazyBatching provides an average 1.4−56× improvement in latency while still achieving competitive system throughput. In terms of QoS, LazyBatching reduces the number of SLA violations by 1.3×. * 9.760 * * batch size: 2 * ![](https://i.imgur.com/nSxg2rQ.png) * grid: <<<784, 1, 1>>> * block: <<<8, 8, 1>>> * Launch Type: Regular * Static Shared Memory: 2,304 bytes * Dynamic Shared Memory: 0 bytes * Registers Per Thread: 63 * Local Memory Per Thread: 0 bytes * Local Memory Total: 17,694,720 bytes * Shared Memory executed: 65,536 bytes * Shared Memory Bank Size: 4 B * Theoretical occupancy: 100 % * batch size: 1 * ![](https://i.imgur.com/hnLHfka.png) * grid: <<<392, 1, 1>>> * block: <<<8, 8, 1>>> * Launch Type: Regular * Static Shared Memory: 2,304 bytes * Dynamic Shared Memory: 0 bytes * Registers Per Thread: 63 * Local Memory Per Thread: 0 bytes * Local Memory Total: 17,694,720 bytes * Shared Memory executed: 65,536 bytes * Shared Memory Bank Size: 4 B * Theoretical occupancy: 100 % * 2. pretrained model issue [name=邱頎霖] * 從 Pytorch 上抓下來的 pretrained model 與 ONNX 提供的 pretrained model 正確性有差異(兩者都有經過 TensorRT 後才做 inference 的) * 各自的 model 圖片連結: * [Pytorch pretrained Resnet50](https://imgur.com/J64ljE1) * [ONNX pretrained Resnet50-v2-7](https://imgur.com/pbn1fKY) * 從圖片比較上可以看出差異是 ONNX 每層都還會加上 Batch normalization * 正確率差異: * ONNX pretrained model: ``` tiger cat | 51.75048% tabby, tabby cat | 26.49407% Egyptian cat | 17.49883% tiger, Panthera tigris | 02.83179% lynx, catamount | 00.10809% ``` * Pytorch pretrained model: ``` tabby, tabby cat | 43.14908% tiger cat | 29.68258% Egyptian cat | 25.96064% tiger, Panthera tigris | 00.34236% lynx, catamount | 00.30307% ``` * [測試圖片](https://github.com/chilin0525/ML_note/blob/ed37dc1c9ccb43b9bd20c8441fd586ba753c1847/Practice/TensorRT/cat.png), ```tabby, tabby cat``` 的機率理論上要最高 * 另外, ONNX 提供的 GoogleNet pretrained model 結果更慘: * Googlenet-9: ``` velvet | 00.21331% shower curtain | 00.10161% wool, woolen, woollen | 00.10158% paper towel | 00.10140% plastic bag | 00.10131% ``` ### 3. Not Resolved Problems ### 4. Goals for next week * Case study [name=李睦樂] * Batching * https://github.com/s094392/nsys_pytorch * https://docs.nvidia.com/deeplearning/frameworks/pyprof-user-guide/ * 找ONNX pretrained model量以下六種model在不同GPU的執行時間 (GoogleNet, ResNet50, DLRM, BERT, SSD, Transformer)。畫一張圖x-axis 不同model(RTX1080, RTX2060) [name=邱頎霖] * y-axis是model執行時間 * y-axis是memory HtoD執行時間 * y-axis是memory DtoD執行時間 * y-axis是memory DtoH執行時間 * MLPerf: https://github.com/mlperf/training_results_v0.7/tree/master/NVIDIA/benchmarks * 量不同layer的執行時間，改之前寫的input一個model就可以量每個layer的執行時間tool [name=邱頎霖] * 讀CUDA的書和練習CUDA程式 [name=陳柏丞] * https://www.sciencedirect.com/science/article/pii/B9780124159921000031 * https://www.sciencedirect.com/science/article/pii/B9780124159921000043 * https://www.sciencedirect.com/science/article/pii/B9780124159921000055 * https://www.sciencedirect.com/science/article/pii/B9780124159921000067 * https://www.sciencedirect.com/science/article/pii/B9780124159921000080 ## Time: 8/31/2021 - 9/7/2021 ### 1. Successes Last Week * Distributing decision * ![](https://i.imgur.com/GKqNS2q.png) * ![](https://i.imgur.com/cgSjyLn.png) * ![](https://i.imgur.com/VN6Pzqe.png) * [name=邱頎霖] <div> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/GTX1080%20total%20inference%20time.png" width=400px> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/RTX2060%20total%20inference%20time.png" width=400px> </div> <div> </div> <div> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/GTX1080%20total%20HtoD%20time.png" width=400px> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/RTX2060%20total%20HtoD%20time.png" width=400px> </div> <div> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/GTX1080%20total%20DtoD%20time.png" width=400px> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/RTX2060%20total%20DtoD%20time.png" width=400px> </div> <div> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/GTX1080%20total%20DtoH%20time.png" width=400px> <img src="https://raw.githubusercontent.com/chilin0525/tmp/main/RTX2060%20total%20DtoH%20time.png" width=400px> </div> ### 2. Progress and Problems Last Week ### 3. Not Resolved Problems ### 4. Goals for next week ## Time: 8/17/2021 - 8/14/2021 ### 1. Successes Last Week * Classify current calculator to adopt new models: * https://github.com/s094392/nsys_pytorch/blob/main/Scheduler.ipynb * ![](https://i.imgur.com/7y8q5mG.png) * ![](https://i.imgur.com/tlC2EaZ.png) * ![](https://i.imgur.com/Wp0K0vh.png) * ![](https://i.imgur.com/5JMbXkw.png) * ![](https://i.imgur.com/kHe5hya.png) ### 2. Progress and Problems Last Week * Some model are not capable with our scheduler * DLRM ### 3. Not Resolved Problems ### 4. Goals for next week ## Time: 9/21/2021 - 9/28/2021 ### 1. Successes Last Week * Scheduling emulator * 可以模擬最 naive 的情形 * https://github.com/s094392/Schduling-Emulator ### 2. Progress and Problems Last Week ### 3. Not Resolved Problems ### 4. Goals for next week ## Time: 9/29/2021 - 10/5/2021 ### 1. Successes Last Week ![](https://i.imgur.com/66Z3NH8.png) ![](https://i.imgur.com/Axjwx2R.png) ![](https://i.imgur.com/peTTPmn.png) **** * 每個 Task 會隨機被 assign 一個 arrival time ### 2. Progress and Problems Last Week * 如何定義 input * Task 密度 * Task model 種類 ### 3. Not Resolved Problems ### 4. Goals for next week * 測出最簡單的 Scoring 的方法的 tail latency ## Time: 10/5/2021 - 10/12/2021 ### 1. Successes Last Week * A Poisson distribution task generator * ![](https://i.imgur.com/2SZP8mu.png) ### 2. Progress and Problems Last Week * Problems with task generation * Poisson ### 3. Not Resolved Problems * Task generation benchmark * Most papers reference mlcommons's loadgen * Optimization target * Maybe choose other target for optimization? * LazyBatching * Task Latency * ![](https://i.imgur.com/CiPKR1x.png) * Throughput * ![](https://i.imgur.com/tH4NwI4.png) * SLA * ![](https://i.imgur.com/1GlXbYx.png) * Batch maker * Task Latency ![](https://i.imgur.com/m8jFfg3.png) ### 4. Goals for next week ## Time: 10/12/2021 - 10/19/2021 ### 1. Successes Last Week * 測試不同gpu差異對於tail latency的影響 * https://docs.google.com/spreadsheets/d/1ZCeK3G27mUhGno_SOb9i_kC7zYfjZiHh20NyT84tOQk/edit?usp=sharing | 0.3 | rate | FIFO | SJF | non naive | | --- | --------- | ------------ | ------------ | ------------ | | | 10000 | 0.8061723913 | 0.8754009681 | 0.6463139762 | | | 100000 | 0.7866307213 | 0.8660946729 | 0.6474199829 | | | 500000 | 0.8193596229 | 0.8848145485 | 0.6591048647 | | | 1000000 | 0.8788208809 | 0.9276951176 | 0.6989261087 | | | 5000000 | 1.046769152 | 1.045323919 | 0.9676406101 | | | 10000000 | 1.031633118 | 1.031656137 | 0.9826849872 | | | 100000000 | 1.002242317 | 1.002241937 | 0.998761604 | | 0.4 | | | | | | | 10000 | 0.8198558876 | 0.8779180202 | 0.6883226032 | | | 500000 | 0.8124924449 | 0.855187335 | 0.6954500124 | | | 100000 | 0.8234293459 | 0.8733644918 | 0.6875973592 | | | 1000000 | 0.8707554766 | 0.9208867572 | 0.7263462004 | | | 5000000 | 1.019908423 | 1.018100637 | 0.9714849129 | | | 10000000 | 1.030036961 | 1.030057441 | 0.9860182923 | | | 100000000 | 1.002998369 | 1.002993124 | 0.9984904732 | | 0.5 | | | | | | | 10000 | 0.8352633553 | 0.8742449486 | 0.7293785208 | | | 500000 | 0.8541765431 | 0.8765256818 | 0.7303339633 | | | 100000 | 0.8465337964 | 0.8801886236 | 0.729228531 | | | 1000000 | 0.8941199977 | 0.9173977155 | 0.7526006764 | | | 5000000 | 1.034508704 | 1.03471046 | 0.9848197705 | | | 10000000 | 1.024843986 | 1.024858062 | 0.9905557422 | | | 100000000 | 1.003454638 | 1.00344444 | 0.9986672435 | | 0.6 | | | | | | | 10000 | 0.872901323 | 0.8899879163 | 0.7710480104 | | | 500000 | 0.8664373006 | 0.8890117349 | 0.7724844997 | | | 100000 | 0.858397395 | 0.8877571076 | 0.7708492166 | | | 1000000 | 0.9062267611 | 0.9291451403 | 0.7868422293 | | | 5000000 | 1.076288463 | 1.076269476 | 0.9777032109 | | | 10000000 | 1.019546667 | 1.020002913 | 0.9937487057 | | | 100000000 | 1.002087097 | 1.002073924 | 0.9993362843 | | 0.7 | | | | | | | 10000 | 0.8978036895 | 0.9060060786 | 0.8104960434 | | | 500000 | 0.8998979991 | 0.9132013032 | 0.8117890251 | | | 100000 | 0.8770580382 | 0.8917264977 | 0.8094471678 | | | 1000000 | 0.9271610025 | 0.9375206327 | 0.8275577685 | | | 5000000 | 1.044237734 | 1.045331387 | 0.984818506 | | | 10000000 | 1.030545239 | 1.030516729 | 0.9895092521 | | | 100000000 | 1.00473294 | 1.004729129 | 0.9989034236 | | 0.8 | | | | | | | 10000 | 0.9126498964 | 0.9223556137 | 0.8546340821 | | | 500000 | 0.9155256317 | 0.9228307988 | 0.8597777686 | | | 100000 | 0.9087146473 | 0.9183437962 | 0.8504761221 | | | 1000000 | 0.9571009766 | 0.9627958452 | 0.864568962 | | | 5000000 | 1.094307273 | 1.093520366 | 0.9867407729 | | | 10000000 | 1.028606668 | 1.028998334 | 0.994665423 | | | 100000000 | 1.002204983 | 1.002193703 | 0.9996312963 | ![](https://i.imgur.com/v2KGrEk.png) ![](https://i.imgur.com/qRNS7uL.png) ![](https://i.imgur.com/S3xbmxz.png) * rCUDA survey * Compatible with CUDA 9.1 * ![](https://i.imgur.com/KbjV2Vi.png) * ![](https://i.imgur.com/18sNbdD.png) * when using rCUDA it is very simple to migrate GPU-accelerated applications because this middleware intercepts all of the CUDA calls performed by the application and thus the consumption of resources in the GPU can be easily tracked ### 2. Progress and Problems Last Week ### 3. Not Resolved Problems ### 4. Goals for next week