[10-26 to 11-02]

# [10-26 to 11-02] ## Profiling on Jetson nano: Tools: - nvprof on nano - visual profiler on host m/c Features: - Shows all kernel calls - Count of each kernel call - Time to execute each Kernel call - GPU activity for kernel calls - Data transfer activity on GPU - *CPU profiling not supported on nano - *System profiling not supported on nano ## Profiling Workflow 1. Run nvprof with necessary args on nano 2. Save log file to Disk (.nvvp format) 3. Copy log file to Host m/c with Nvidia Visual Profiler installed and load it. 4. Understand bottlenecks in GPU compute or Memory operations 5. Optimise cuda code accordingly. 6. Profiling remotely by connecting to nano from host is not working. It is asking for executable object file for profiling, which is standard in C, C++, Cuda. ## Profiling in CLI - Profiled smile classifier inference code on smile dataset testset. - Current unoptimised throughtput is 49.45 images/sec @ 94.37% accuracy. ``` nvprof -o gpu-log-01.nvvp python3 inference_trt.py ``` ``` nvprof -o gpu-log-02.nvvp --analysis-metrics python3 inference_trt.py ``` ``` nvprof -o gpu-log-03.nvvp --print-gpu-trace --print-api-trace inference_trt.py ``` CLI Profiling result: ![](https://i.imgur.com/wIux2Uq.jpg) ![](https://i.imgur.com/rnwK9gm.jpg) ## How to install Nvidia Visual profiler on mac: - https://developer.nvidia.com/nvidia-cuda-toolkit-developer-tools-mac-hosts ## Visualising log on Visual profiler tool - Open the .nvvp log file in the tool ![](https://i.imgur.com/owrRTSu.jpg) - Analysis ![](https://i.imgur.com/ZP2ydHv.jpg) ## Feedback from Tool: 1. Low Memcpy/Kernel Overlap [0 ns / 135 ms = 0% utilisation] - The percentage of time when memcpy is being performed in parallel with Kernel is low. 2. Low Kernel Concurrency [0ns / 10ms = 0% utilisation] - The percentage of time when two kernels are being executed in parallel is low. ## Current task: - working on doing batch inference. - profile the batch inference code. - Try foccussed profiling ## Areas for optimization: 1. Accumulate batch of images (CPU + disk/buffer) 2. Preprocessing batch images (GPU) NumPy - > CuPy 3. Passing preprocessed batch images to GPU DtoD copy or DtoH copy + HtoD copy ## Additional plugin for pytorch code mapping: - Nvidia Apex for foccussed profiling of code. - https://github.com/NVIDIA/apex Documentation: https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf