Nsight Systems ============== The command line interface -------------------------- To perform the measurements when the program starts, you need to run it with ```` nsys profile [optional command_switch_options] [application_executable] [optional application_options] ```` This command will generate a report in the format **.nsys_rep**, to open in the GUI. :::info The GUI can be install on your laptop from `here <https://developer.nvidia.com/nsight-systems>`_ ::: The **command switch options** define what to measure at runtime. Here we summarize some of the most used: * ``--trace=cuda,nvtx,cublas,openacc,openmp,mpi,<...>`` defines the API to be traced * ``--cuda-memory-usage=true`` visualize the GPU memory used ( ! high overhead ) * ``--delay`` starts the measurement after a delay time (s) * ``--duration`` stops the measurement after an interval (s) * ``--env-var`` passes environment variable to the profiling daemon * ``--sample`` selects how to sample the CPU * ``--sampling-frequency`` changes the frequency of CPU sampling For a detailed list, please check the command ```` nsys profile --help ```` or the User Guide. :::warning To sample the CPU on Leonardo, ``--backtrace=dwarf`` is needed ::: The GUI ------- The Graphical User Interface of NSight Systems provides: * a **timeline view** to display timespan and location of the events intercepted * an **analysis summary** collecting information on the command line used, environment variables, processes and gpus traced * **statistical summaries** for events sampled and traced (Flat profiles, Top-down ...) ### The CUDA HARDWARE panel In the CUDA HW panel the timeline displays events occuring on the accelerators. In particular the amount of time spent computing and moving dat: * ``Memory`` : percentage of time spent on the GPU moving data in any direction (H2D, D2H, D2D and memset) * ``Kernel`` : percentage of time spent on the GPU executing kernels (OpenACC, CUDA, ...) :::warning percentages refers to the time spent on the GPU, not total time of program execution ::: Different colors are used on the timeline to identify H2D (green), D2H (pink), D2D (brown) data movements and kernel (light blue). This panel shows also the duration of each kernels or data movement on the GPU per stream By moving the cursor on the kernel in the timeline, you can view information related to the kernel, such as the actual grid and block dimension used on the GPU, the theoretical occupacy, the use of local memory, etc. ![image](https://hackmd.io/_uploads/BJu5RE9xJe.png) Nvtx ---- With nsys the source code is not automatically instrumented. The **nvtx** library ( ``nvtoolsext.h`` ) provides c-based apis to annotate events and ranges in the source code. These apis are indeed intercepted by nsys and can be visualized on the gui timeline. ### Markers and Ranges For example, you can use **ranges** to wrap a routine, as in this example: ```` #include "nvToolsExt.h" ... void init_host_data( int n, double * x ) { //start the range here nvtxRangePushA("init_host_data"); //initialize x on host ... //stop the range here nvtxRangePop(); } ... ```` ```` use nvtx ... subroutine init_host_data( n, x ) !start the range here call nvtxStartRange("init_host_data") !initialize x on host ... !stop the range here call nvtxEndRange() end subroutine ```` During the time span of the routine execution, the timeline view will show an event labelled "init_host_data". NVTX ranges can also be nested. The NVTX library has a limited overhead when the tool is not attached to the application There are two kinds of APIs: * **NVTX Markers** : annotate events occurring at a specific time * **NVTX Ranges** : annotate timespan of code regions The object of these APIs are strings or events. If you use a string, you simply range the code region with a name. If you use an event, you can attribute to the code region also a color, a domain and other features. ### Events classification An *event* can be defined in two ways: * by a message in ASCII or Unicode * by a C structure with attributes In case the event is a C structure, the attributes are defined as follows: ```` //Set to default nvtxEventAttributes_t eventAttrib = {0}; //Declare version and size eventAttrib.version = NVTX_VERSION; eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE; // Message type and message // // ASCII eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; eventAttrib.message.ascii = __FUNCTION__ ":ascii"; // //UNICODE eventAttrib2.messageType = NVTX_MESSAGE_TYPE_UNICODE; eventAttrib2.message.unicode = __FUNCTIONW__ L ":unicode \u2603 snowman"; // Color type and color eventAttrib.colorType = NVTX_COLOR_ARGB; eventAttrib.color = COLOR_YELLOW; ```` :::info The advantage of using events with attributes is the possibility to define the color of the range (improves readability) and more properties useful to filter (e.g. the domain) ::: ### The APIs **Markers** to annotate events at a specific time are defined with * ``nvtxMarkA(__FUNCTION__ ":nvtxMarkA")`` for UNICODE messages * ``nvtxMarkW(__FUNCTIONW__ L":nvtxMarkW")`` for ASCII messages * ``nvtxMarkEx(&eventAttrib)`` for events with attributes **Ranges** for nested time ranges occurring on a CPU thread are defined by the following start/stop routines: * start: * ``nvtxRangePushA(__FUNCTION__ ":nvtxMarkA"`` * ``nvtxRangePushW(__FUNCTIONW__ L":nvtxMarkW")`` * ``nvtxRangePushEx(&eventAttrib)`` * end : * ``nvtxRangePop()`` The NVHPC suite provides interfaces for Fortran. The ranges can be called with * start : ``nvtxStartRange("name",id)`` * end : ``nvtxEndRange()`` where "name" is the range label, and id sets the color as defined in the following code snapshot. ```` module nvtx use iso_c_binding implicit none integer,private :: col(7) = [ Z'0000ff00', Z'000000ff', Z'00ffff00', Z'00ff00ff', Z'0000ffff', Z'00ff0000', Z'00ffffff'] character,private,target :: tempName(256) type, bind(C):: nvtxEventAttributes integer(C_INT16_T):: version=1 integer(C_INT16_T):: size=48 ! integer(C_INT):: category=0 integer(C_INT):: colorType=1 ! NVTX_COLOR_ARGB = 1 integer(C_INT):: color integer(C_INT):: payloadType=0 ! NVTX_PAYLOAD_UNKNOWN = 0 integer(C_INT):: reserved0 integer(C_INT64_T):: payload ! union uint,int,double integer(C_INT):: messageType=1 ! NVTX_MESSAGE_TYPE_ASCII = 1 type(C_PTR):: message ! ascii char end type interface nvtxRangePush ! push range with custom label and standard color subroutine nvtxRangePushA(name) bind(C, name='nvtxRangePushA') use iso_c_binding character(kind=C_CHAR) :: name(256) end subroutine ! push range with custom label and custom color subroutine nvtxRangePushEx(event) bind(C, name='nvtxRangePushEx') use iso_c_binding import:: nvtxEventAttributes type(nvtxEventAttributes):: event end subroutine end interface interface nvtxRangePop subroutine nvtxRangePop() bind(C, name='nvtxRangePop') end subroutine end interface contains subroutine nvtxStartRange(name,id) character(kind=c_char,len=*) :: name integer, optional:: id type(nvtxEventAttributes):: event character(kind=c_char,len=256) :: trimmed_name integer:: i trimmed_name=trim(name)//c_null_char ! move scalar trimmed_name into character array tempName do i=1,LEN(trim(name)) + 1 tempName(i) = trimmed_name(i:i) enddo if ( .not. present(id)) then call nvtxRangePush(tempName) else event%color=col(mod(id,7)+1) event%message=c_loc(tempName) call nvtxRangePushEx(event) end if end subroutine subroutine nvtxEndRange call nvtxRangePop end subroutine end module nvtx ```` The above wrapper is available from the HPCSDK suite and can be linked with ``-lnvhpcwrapnvtx``. :::info NVTX ranges are APIs called from the host, and thus their duration on the timeline view refers to the time the program spends within this range on the host. Note however that **CPU and GPU operation can be asynchronous**. If the range contains some operations launched to the GPU, the range is projected on the GPU, and its length identifies the duration of the offloaded operations on the device. ::: ![image](https://hackmd.io/_uploads/HJ8OdCjekx.png) Postprocessing -------------- In the following sections we recap some commands to postprocess data collected with nsys. ### Statistical summary The following command generates a summary of the events intercepted: ```` nsys stats report.nsys-rep ```` Different kind of metrics are processed, according to the event kind. * ``nvtxsum`` * ``cudaapisum`` * ``gpukernsum`` * ``gpumemtimesum`` * ``gpumemsizesum`` By using the ``--format`` command switch, you can chose the preferred format. Check ``nsys stats --help`` for a complete list of the switches. This command is helpful to analyze for example the cost of data movements done with OpenACC ```` $ nsys stats -r openaccsum report_acc_data1.nsys-rep Generating SQLite file report_acc_data1.sqlite from report_acc_data1.nsys-rep Exporting 63118 events: [==================================================100%] Processing [report_acc_data1.sqlite] with [/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/nvhpc-23.1-x5lw6edfmfuot2ipna3wseallzl4oolm/Linux_x86_64/23.1/profilers/Nsight_Systems/host-linux-x64/reports/openaccsum.py]... ** OpenACC Summary (openaccsum): Time(%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name ------- --------------- --------- ------------ ------------ ---------- ----------- ------------ ------------------------------- 27.9 2,422,224,252 50 48,444,485.0 45,729,730.5 45,576,626 176,454,147 18,476,396.7 Exit Data@jacobi.f90:14 26.4 2,292,084,948 50 45,841,699.0 45,729,495.0 45,514,774 47,135,248 366,299.5 Exit Data@cfd.f90:144 20.5 1,779,791,765 50 35,595,835.3 35,474,408.0 35,253,494 36,959,563 390,115.3 Enter Data@cfd.f90:144 19.8 1,719,850,039 50 34,397,000.8 33,854,316.0 33,666,671 55,305,149 3,046,996.4 Enter Data@jacobi.f90:14 1.4 123,216,371 50 2,464,327.4 2,328,897.0 2,309,384 8,864,364 923,773.8 Wait@jacobi.f90:21 1.3 115,610,293 50 2,312,205.9 2,307,225.0 2,283,833 2,386,689 21,258.0 Wait@cfd.f90:150 0.7 58,974,491 100 589,744.9 584,813.0 497,833 684,812 88,260.9 Wait@cfd.f90:144 0.7 58,777,669 100 587,776.7 513,375.0 48,488 683,116 100,939.1 Wait@jacobi.f90:14 0.3 27,516,377 1 27,516,377.0 27,516,377.0 27,516,377 27,516,377 0.0 Device Init@jacobi.f90:14 0.3 26,569,915 50 531,398.3 530,716.0 523,898 548,907 3,876.9 Compute Construct@jacobi.f90:14 0.3 26,165,316 50 523,306.3 522,872.0 519,415 530,353 2,344.2 Compute Construct@cfd.f90:144 0.1 8,457,818 1,000 8,457.8 7,139.5 5,890 72,950 3,862.6 Enqueue Upload@cfd.f90:144 ```` The summaries can also be visualized from the GUI ### Recipes Recipe is a new feature available since HPCSDK v23.5, which allows you to summarize events from multiple reports. It requires Python and Jupyter Notebook. The recipes available are given by ```` usage: nsys recipe [<args>] <recipe name> [<recipe args>] -h, --help Print the command's help menu. -q, --quiet Only display errors. The following built-in recipes are available: cuda_api_sum – CUDA API Summary cuda_api_sync – CUDA Synchronization APIs cuda_gpu_kern_pace – CUDA GPU Kernel Pacing cuda_gpu_kern_sum – CUDA GPU Kernel Summary cuda_gpu_mem_size_sum – CUDA GPU MemOps Summary (by Size) cuda_gpu_mem_time_sum – CUDA GPU MemOps Summary (by Time) cuda_gpu_time_util_map – CUDA GPU Kernel Time Utilization Heatmap cuda_memcpy_async – CUDA Async Memcpy with Pageable Memory cuda_memcpy_sync – CUDA Synchronous Memcpy cuda_memset_sync – CUDA Synchronous Memset dx12_mem_ops – DX12 Memory Operations gpu_gaps – GPU Gaps gpu_metric_util_map – GPU Metric Utilization Heatmap gpu_time_util – GPU Time Utilization mpi_sum – MPI Summary nccl_sum – NCCL Summary nvtx_gpu_proj_sum – NVTX GPU Projection Summary nvtx_gpu_proj_trace – NVTX GPU Trace nvtx_pace – NVTX Pacing nvtx_sum – NVTX Range Summary osrt_sum – OS Runtime Summary ```` Filters ------- In the following, we recap two approaches to filter the measurement in duration. Filtering can be used for example in the following cases: * the trace is too long (max 5-10 minutes, depending on the amount of events observed); * we need statistics for a selected phase of the simulation; * the size of the report generated is too large. ### CUDAProfiler APIs The CUDA profiler library ( ``cudaProfiler.h`` ) provides API to start and stop the measurement. These APIs are: * ``CUresult cuProfilerStart ( void )`` : tells the daemon to start the measurement; * ``CUresult cuProfilerStop ( void )`` : tells the daemon to stop the measurement. If the profiler is launched with the following switch, ```` nsys profile --capture-range=cudaProfilerApi [...] ```` the deamon will start and stop the measurement when the APIs are intercepted. The use of CUDAProfiler APIs requires to recompile the source code anytime we need to change the target of the measurement. ### NVTX ranges NVTX ranges can be used also to filter the measurement. That is, the deamon will start and stop the measurement when the range starts and stop. If the code is already decorated with NVTX ranges, you do not need to reinstrument the application to profile a different code region. To start and stop the measurement when the program intercepts a range (e.g. 'main_loop'), the profiler is launched with the following switch ```` nsys profile --trace=osrt,nvtx --capture-range=nvtx --nvtx-capture='main_loop' --env-var=NSYS_NVTX_PROFILER_REGISTER_ONLY=0 ```` In case of multiple occurence of the nvtx range, it is possible to capture a given number of them with the following switches ```` --capture-range-end=repeat[:N],repeat-shutdown:N ```` Multi-GPU reports ----------------- NSight Systems supports also distributed programs with multiple GPUs and MPI ranks. It is possible to generate one report per MPI rank, by interposing the CLI between the MPI starter (mpirun/srun) and the program executable ``` mpirun nsys profile <command-switches> <executable> <executable arguments> ``` The report will be numbered according to the rank id. When using `mpirun`, it is suggested to use the switch ``--output=report-%q{OMPI_COMM_WORLD_RANK}`` to number the reports according to the openmpi rank id. The GUI will show in each report the metrics of the GPU device(s) visibile for the selected MPI rank. If the MPI ranks are on the same node, it is possible to generate one report containing events from all the MPIs and GPUs by preponing the CLI to the starter: ``` nsys profile <command-switches> mpirun <executable> <executable arguments> ```` Multiple reports can be visualized from the GUI in the same timeline with the new "multi-report" feature. Statistics are generated per MPI rank. Recipes can be used to summarize them. The following wrapper can be used to instrument selectively only one MPI rank, e.g. number 0: ``` #!/bin/bash if [ $OMPI_COMM_WORLD_RANK -eq 0 ]; then nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -t mpi "$@" else "$@" fi ``` :::warning ***NSYS_MPI_STORE_TEAMS_PER_RANK=1** is needed when profiling a subset of MPI ranks, to store all members of custom MPI communicators per MPI rank. Otherwise, the execution might hang or fail with an MPI error. :::