Omnitrace - Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2024

# Omnitrace - Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2024 -------------------------------------------------------------- These exercises are based on the AMD training project in https://github.com/amd/HPCTrainingExamples. You can clone it to your home folder as: ``` git clone https://github.com/amd/HPCTrainingExamples ``` # Rocprof Setup environment ```bash salloc --cpus-per-task=8 --mem=0 --ntasks-per-node=4 module load rocm ``` Download examples repo and navigate to the `HIPIFY` exercises. ```bash cd ~/HPCTrainingExamples/HIPIFY/mini-nbody/hip/ ``` Update the bash scripts with `$ROCM_PATH`: ```bash sed -i 's/\/opt\/rocm/${ROCM_PATH}/g' *.sh ``` Compile all: ```bash ./HIP-nbody-orig.sh ``` or compile and run one case: ```bash hipcc -I../ -DSHMOO nbody-orig.cpp -o nbody-orig ``` Run `rocprof` on nbody-orig to obtain hotspots list: ```bash rocprof --stats nbody-orig 65536 ``` Check `results.csv` to find, for each invocation of each kernel, details such as grid size (`grd`), workgroup size (`wgr`), LDS used (`lds`), scratch used if register spilling happened (`scr`), number of SGPRs and VGPRs used, etc. Note that grid size is equal to the total number of work-items, not the number of work groups. This is the output that is useful if you allocate shared memory dynamically, for instance. ```bash cat results.csv ``` Check the statistics result file, one line per kernel, sorted in descending order of durations. ```bash cat results.stats.csv ``` Using `--basenames on` will show only kernel names without their parameters. ```bash rocprof --stats --basenames on nbody-orig 65536 ``` Check the statistics result file, one line per kernel, sorted in descending order of durations ```bash cat results.stats.csv ``` Trace HIP, GPU and Copy activity with `--hip-trace`. This produces a results.json file which can be downloaded to your workstation and viewed in Perfetto UI. This JSON file contains the timeline trace for this application, but shows only GPU, Copy and HIP API activity. ```bash rocprof --hip-trace nbody-orig 65536 ``` One of the outputs of the `--hip-trace` option is the file `results.hip_stats.csv` which lists the HIP API calls and their durations, sorted in descending order. This can be useful to find HIP API calls that may be bottlenecks. ```bash cat results.hip_stats.csv ``` Profile the HSA API also by adding the `--hsa-trace` option. This is useful if you are profiling OpenMP target offload code, for instance, as the compiler implements all GPU offloading via the HSA layer. ```bash rocprof --hip-trace --hsa-trace nbody-orig 65536 ``` Check the new file `results.hsa_stats.csv` which contains the statistics information for HSA calls. ```bash cat results.hsa_stats.csv ``` On your laptop, download `results.json`: ```bash scp scp://USER@aac1.amd.com:<PORT>/~/HPCTrainingExamples/HIPIFY/mini-nbody/hip/results.json ./ ``` Open a browser and go to [https://ui.perfetto.dev/](https://ui.perfetto.dev/). Click on `Open trace file` in the top left corner. Navigate to the `results.json` you just downloaded. Use WASD to navigate the GUI ![image](https://user-images.githubusercontent.com/109979778/225451481-46ffd521-2453-4caa-8d28-fa4e0f4c4889.png) Read about hardware counters available for the GPU on this system (look for gfx90a section). ```bash less $ROCM_PATH/lib/rocprofiler/gfx_metrics.xml ``` Create a `rocprof_counters.txt` file with the counters you would like to collect: ```bash vi rocprof_counters.txt ``` Content for `rocprof_counters.txt`: ```bash pmc : Wavefronts VALUInsts pmc : SALUInsts SFetchInsts GDSInsts pmc : MemUnitBusy ALUStalledByLDS ``` Execute with the counters we just added: ```bash rocprof --timestamp on -i rocprof_counters.txt nbody-orig 65536 ``` You'll notice that `rocprof` runs 3 passes, one for each set of counters we have in that file. View the contents of `rocprof_counters.csv` for the collected counter values for each invocation of each kernel. ```bash cat rocprof_counters.csv ``` -------------------------------------------------------------- # Omnitrace Setup environment: ```bash module purge module load omnitrace gcc/13 ``` ## Basic Omnitrace setup List the various options and environment settings available for the `omnitrace` category: ```bash omnitrace-avail --categories omnitrace ``` To add brief descriptions, use `-bd` option ```bash omnitrace-avail -bd --categories omnitrace ``` Create a configuration file. This config file can then be modified to change options to see what you want in the Omnitrace profile. If you do not provide a path to the config file, it will generate one in the current directory `./omnitrace-config.cfg`. ```bash omnitrace-avail -G ~/omnitrace.cfg ``` Alternatively, you can create an Omnitrace configuration file with description per option. Beware, this is quite verbose. ```bash omnitrace-avail -G ~/omnitrace_all.cfg --all ``` Declare that you want to use this configuration file. Only needed if you had provided a custom path and/or filename for the config file when you created it. ```bash export OMNITRACE_CONFIG_FILE=~/omnitrace.cfg ``` ## Setup Jacobi Example Go to the jacobi code in the examples repo: ```bash cd ~/HPCTrainingExamples/HIP/jacobi ``` Compile the code: ```bash make ``` Execute the binary to make sure it runs successfully: - Note: To get rid of `Read -1, expected 4136, errno = 1` add `--mca pml ucx --mca pml_ucx_tls ib,sm,tcp,self,cuda,rocm` to the `mpirun` command line ```bash mpirun -np 1 ./Jacobi_hip -g 1 1 ``` ## Dynamic Instrumentation Run the code with omnitrace to get runtime instrumentation. Time it to see overhead of dyninst loading all libraries in the beginning. We observed that this command took about 2m30s. ```bash mpirun -np 1 omnitrace-instrument -- ./Jacobi_hip -g 1 1 ``` Check available functions to instrument using the `--print-available functions` option. Note, the `--simulate` option will not execute the binary. ```bash mpirun -np 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1 ``` If you want to instrument only particular functions from that list, try the `-I` option to include only those functions: ```bash mpirun -np 1 omnitrace-instrument -v 1 -I 'Jacobi_t::Run' 'JacobiIteration' -- ./Jacobi_hip -g 1 1 ``` This should show that only those functions are instrumented: ```bash <snip> [omnitrace][exe] 1 instrumented funcs in JacobiIteration.hip [omnitrace][exe] 1 instrumented funcs in JacobiRun.hip [omnitrace][exe] 1 instrumented funcs in Jacobi_hip [omnitrace][exe] 2 instrumented funcs in librocprofiler-register.so.0.3.0 <snip ``` ## Binary Rewrite Create instrumented binary using `omnitrace-instrument` command first. Notice that this doesn't take very long to run. ```bash omnitrace-instrument -o ./Jacobi_hip.inst -- ./Jacobi_hip ``` Execute the new instrumented binary using the `omnitrace-run` command inside `mpirun`. Time it to see lower overhead. This is the recommended way to profile MPI applications as Omnitrace will separate the output files for each rank. ```bash mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1 ``` See the list of the instrumented GPU calls, turn on `OMNITRACE_PROFILE` in your `$OMNITRACE_CONFIG_FILE`. ```bash OMNITRACE_PROFILE = true ``` Then run the instrumented binary again. This time, you will observe that it generated a few extra files. One of those has a list of instrumented GPU calls and durations of those calls. ```bash cat omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/roctracer-0.txt ``` ## Debugging omnitrace-run If you get errors when you run an instrumented binary or when you run with dynamic instrumentation, add the following options `--monochrome -v 2 --debug` and try. This would give you more debug information to aid you in figuring out where the problem may lie. ``` mpirun -np 1 omnitrace-run --monochrome -v 1 --debug -- ./Jacobi_hip.inst -g 1 1 ``` ## Visualization Copy the `perfetto-trace-0.proto` to your laptop, open the web page [https://ui.perfetto.dev/](https://ui.perfetto.dev/) ```bash scp scp://USER@aac1.amd.com:<PORT>/~/HPCTrainingExamples/HIP/jacobi/omnitrace-Jacobi_hip.inst-output/TIMESTAMP/perfetto-trace-0.proto ./ ``` Click `Open trace file` and select the `.proto` file ![image](https://user-images.githubusercontent.com/109979778/225769857-900aa6dd-1c7a-440f-82ab-872dcc09d73c.png) ## Hardware Counters See a list of all the counters for all the devices on the node: ```bash omnitrace-avail --all ``` Declare in your configuration file: ```bash OMNITRACE_ROCM_EVENTS = VALUUtilization,FetchSize ``` Check again: ```bash grep OMNITRACE_ROCM_EVENTS $OMNITRACE_CONFIG_FILE ``` Run the instrumented binary, and you will observe an output file for each hardware counter specified. You should also see a row for each hardware counter in the perfetto trace generated by Omnitrace. Note that you do not have to instrument again after making changes to the config file. Just running the instrumented binary picks up the changes you make in the config file. Ensure that the `OMNITRACE_CONFIG_FILE` environment variable is pointing to your config file. ```bash mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1 ``` The output should show something like this: ```bash [omnitrace][0][rocprof-device-0-VALUUtilization]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.json' [omnitrace][0][rocprof-device-0-VALUUtilization]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.txt' [omnitrace][0][rocprof-device-0-FetchSize]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-FetchSize-0.json' [omnitrace][0][rocprof-device-0-FetchSize]> Outputting 'omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-FetchSize-0.txt' ``` If you do not want to see the details for every CPU core, modify the config file to select only what you want to see, say CPU cores 0-2 only: ```bash OMNITRACE_SAMPLING_CPUS = 0-2 ``` Now running the instrumented binary again will show significantly fewer CPU lines in the profile. ```bash mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1 ``` ## Profiling Multiple Ranks Run the instrumented binary with multiple ranks. You'll find multiple `perfetto-trace-*.proto` files, one for each rank. ```bash mpirun -np 2 omnitrace-run -- ./Jacobi_hip.inst -g 2 1 ``` You can visualize them separately in Perfetto, or combine them using cat and visualize them in the same Perfetto window. ```bash cat perfetto-trace-0.proto perfetto-trace-1.proto > allprocesses.proto ``` ## Sampling Set the following in your configuration file: ```bash OMNITRACE_USE_SAMPLING = true OMNITRACE_SAMPLING_FREQ = 100 ``` Execute the instrumented binary and visualize the perfetto trace. ```bash mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1 ``` Scroll down to the very bottom to see the sampling output. Those traces will be annotated with a (S) as well. ## Kernel Timings Open the `wall_clock-0.txt` file: ```bash cat omnitrace-Jacobi_hip.inst-output/<TIMESTAMP>/wall_clock-0.txt ``` In order to see the kernel durations aggregated in your configuration file, make sure to set in your config file or in the environment: ```bash OMNITRACE_PROFILE = true OMNITRACE_FLAT_PROFILE = true ``` Execute the code and check the `wall_clock-0.txt` file again. Instead of updating the config file, you can also set the environment variables to achieve the same effect. ```bash OMNITRACE_PROFILE=true OMNITRACE_FLAT_PROFILE=true mpirun -np 1 omnitrace-run -- ./Jacobi_hip.inst -g 1 1 ```