# Omintrace ## Setup * Download the most suitable version from here: https://github.com/AMDResearch/omnitrace/releases * Create the installation directory * ./[omnitrace-1.7.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3.sh](https://github.com/AMDResearch/omnitrace/releases/download/v1.7.0/omnitrace-1.7.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3.sh) --prefix=/opt/omnitrace --exclude-subdir Full documentation: https://amdresearch.github.io/omnitrace/ ## Checking installation ### Setup the environment ` source omnitrace_installation_path/share/omnitrace/setup-env.sh` ### Checking omnitrace `omnitrace --help` `omnitrace-avail --help` *Note:* There is a chance to need to declare LD_LIBRARY_PATH, do not forget to load ROCm module ## Instrumentation modes * Runtime instrumentation (dynamic binary instrumentation) * It can instrument a lot of data and increased overhead if a user does not do the proper declarations * Sampling instrumentation * Use the omnitrace-sample * Attaching to a process (-p ) * Attaching to a process already running, it can have similar issues as above * Binary rewriting (-o ) * It will not instrument the dynamically-linked libraries. Consequently, this mode performs the instrumentation significantly faster and has a much lower overhead when running the instrumentated executable * Binary rewriting is the recommended mode when the target executable uses process-level parallelism (e.g. MPI) * Binary rewriting with dynamic libraries: https://amdresearch.github.io/omnitrace/instrumenting.html#binary-rewriting-a-library ## Runtime instrumentation `omnitrace <omnitrace-options> -- <exe> [<exe-options>]` ## Sampling instrumentation `omnitrace-sample <omnitrace-options> -- <exe> [<exe-options>]` ## Binary Rewrite `omnitrace <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>` ## Application-Example ### Download HIP examples: `git clone https://github.com/ROCm-Developer-Tools/HIP.git` `cd HIP/samples/2_Cookbook/0_MatrixTranspose` Let's see the main content of the ``` // Device (Kernel) function, it must be void __global__ void matrixTranspose(float* out, float* in, const int width) { int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y; out[y * width + x] = in[x * width + y]; } ``` ``` // allocate the memory on the device side hipMalloc((void**)&gpuMatrix, NUM * sizeof(float)); hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float)); // Memory transfer from host to device hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice); std::cout << "Device name " << devProp.name << std::endl; hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);` // Lauching kernel from host hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH); // Memory transfer from device to host hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost); // CPU MatrixTranspose computation matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);` ``` ### Reserve 1 GPU `salloc -N 1 -A project_462000075 -p gpu -n 1 --gpus 1 -t 01:00:00` `module load rocm/5.0.2` ### Compile and Execute #### Compile and Execute MatrixTranspose example: * Comment the execution of test in the line 41 in the Makefile to avoid executing the binary: `all: $(EXECUTABLE) test` -> `all: $(EXECUTABLE) #test` * Compile `make` * Execute `srun -n 1 ./MatrixTranspose ` Error: `srun -n 1 ./MatrixTranspose "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" srun: error: nid005660: task 0: Aborted srun: launch/slurm: _step_signal: Terminating StepId=1731160.5` * Solution: * We need to compile with the flag `--offload-arch=gfx90a`, so in the Makefile convert the `CXXFLAGS =-g` to `CXXFLAGS =-g --offload-arch=gfx90a`. You can add also other variable like `HIPFLAGS` and adjust the Makefile * Compile on the compute node `srun -n 1 make` It is better to define the flags.If you compile with Cray compiler. loading the module ` craype-accel-amd-gfx90a` is enough. * Timing: `srun -n 1 ./MatrixTranspose PASSED!` ## Check Omnitrace configuration `srun -n 1 omnitrace-avail --categories omnitrace --brief --description` [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 ![](https://i.imgur.com/bOjZ3Di.png) `srun -n 1 omnitrace-avail --categories omnitrace` ![](https://i.imgur.com/5dRavEx.png) ## Create a configuration file * All the options with some default settings `srun -n 1 omnitrace-avail -G omnitrace.cfg` * All the options with description included `srun -n 1 omnitrace-avail -G omnitrace.cfg --all` In order to define which configuration file will be used, declare: `export OMNITRACE_CONFIG_FILE=/path/name.cfg` ## Execution in Dynamic instrumentation mode * Load the appropriate modules ``` srun -n 1 omnitrace --help [omnitrace][exe] [omnitrace][exe] command :: ''... [omnitrace][exe] [omnitrace] Usage: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/bin/omnitrace [ --help (count: 0, dtype: bool) --verbose (max: 1, dtype: bool) --error (max: 1, dtype: boolean) --debug (max: 1, dtype: bool) --log (count: 1) --log-file (count: 1) --simulate (max: 1, dtype: bool) --print-format (min: 1, dtype: string) --print-dir (count: 1, dtype: string) --print-available (count: 1) --print-instrumented (count: 1) --print-coverage (count: 1) --print-excluded (count: 1) --print-overlapping (count: 1) --print-instructions (max: 1, dtype: bool) --output (min: 0, dtype: string) --pid (count: 1, dtype: int) --mode (count: 1) --force (max: 1, dtype: bool) --command (count: 1) --prefer (count: 1) --library (count: unlimited) --main-function (count: 1) --load (count: unlimited, dtype: string) --load-instr (count: unlimited, dtype: filepath) --init-functions (count: unlimited, dtype: string) --fini-functions (count: unlimited, dtype: string) --all-functions (max: 1, dtype: bool) --function-include (count: unlimited) --function-exclude (count: unlimited) --function-restrict (count: unlimited) --module-include (count: unlimited) --module-exclude (count: unlimited) --module-restrict (count: unlimited) --label (count: unlimited, dtype: string) --config (min: 1, dtype: string) --default-components (count: unlimited, dtype: string) --env (count: unlimited) --mpi (max: 1, dtype: bool) --instrument-loops (max: 1, dtype: boolean) --min-instructions (count: 1, dtype: int) --min-address-range (count: 1, dtype: int) --min-instructions-loop (count: 1, dtype: int) --min-address-range-loop (count: 1, dtype: int) --coverage (max: 1, dtype: bool) --dynamic-callsites (max: 1, dtype: boolean) --traps (max: 1, dtype: bool) --loop-traps (max: 1, dtype: bool) --allow-overlapping (max: 1, dtype: bool) --parse-all-modules (max: 1, dtype: bool) --batch-size (count: 1, dtype: int) --dyninst-rt (min: 1, dtype: filepath) --dyninst-options (count: unlimited) ] -- <CMD> <ARGS> ``` ``` srun -n 1 --gpus 1 omnitrace -- ./MatrixTranspose [omnitrace][exe] [omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'... [omnitrace][exe] [omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/lib/omnitrace/libdyninstAPI_RT.so.11.0.1 [omnitrace][exe] Finding instrumentation functions... [omnitrace][exe] 2 instrumented funcs in libamd_comgr.so.2.4.50002 [omnitrace][exe] 3 instrumented funcs in libamdhip64.so.5.0.50002 [omnitrace][exe] 15 instrumented funcs in libhsa-runtime64.so.1.5.50002 [omnitrace][exe] 2 instrumented funcs in libtinfo.so.6.1 [omnitrace][exe] 3 instrumented funcs in libz.so.1.2.11 [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.json'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.txt'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.json'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.txt'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.json'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.txt'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.json'... Done [omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.txt'... Done [omnitrace][exe] Executing... [omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 [706.292] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" Device name Device name [omnitrace][118151][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(255706649483053:255706648575215) delta=-907838, device=0, queue=0, pid=0, tid=0, op=COPY PASSED! [omnitrace][118151][0][omnitrace_finalize] [omnitrace][118151][0][omnitrace_finalize] finalizing... [omnitrace][118151][0][omnitrace_finalize] [omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151 : 0.364673 sec wall_clock, 237.052 MB peak_rss, 230.597 MB page_rss, 0.330000 sec cpu_clock, 90.6 % cpu_util [laps: 1] [omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151/thread/0 : 0.364280 sec wall_clock, 0.289480 sec thread_cpu_clock, 79.5 % thread_cpu_util, 237.028 MB peak_rss [laps: 1] [omnitrace][118151][0][omnitrace_finalize] [omnitrace][118151][0][omnitrace_finalize] Finalizing perfetto... [omnitrace][118151][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-16_22.53/perfetto-trace.proto' (57.18 KB / 0.06 MB / 0.00 GB)... Done [omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.json' [omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.txt' [omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.json' [omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.txt' [omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.json' [omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.txt' [omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.json' [omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.txt' [omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.json' [omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.txt' [omnitrace][118151][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/metadata-118151.json' and 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/functions-118151.json' [omnitrace][118151][0][omnitrace_finalize] Finalized [706.822] perfetto.cc:57383 Tracing session 1 ended, total sessions:0 [omnitrace][exe] End of omnitrace ``` In this case we have more than 120x times profiling overheard. ## Why so much overhead? ``` nm --demangle MatrixTranspose | egrep -i ' (t|u) ' 000000000020d080 t _GLOBAL__sub_I_MatrixTranspose.cpp 000000000020c970 T __device_stub__warmup() 000000000020ca40 T matrixTransposeCPUReference(float*, float*, unsigned int) 000000000020c9c0 T __device_stub__matrixTranspose(float*, float*, int) U std::ctype<char>::_M_widen_init() const U std::ostream::put(char) U std::ostream::flush() U std::ios_base::Init::Init() U std::ios_base::Init::~Init() U std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) U std::__throw_bad_cast() U __cxa_atexit 000000000020c930 t __do_global_dtors_aux U __hipPopCallConfiguration U __hipPushCallConfiguration U __hipRegisterFatBinary U __hipRegisterFunction U __hipUnregisterFatBinary 000000000020cfd0 t __hip_module_ctor 000000000020d060 t __hip_module_dtor 000000000020d12e T __libc_csu_fini 000000000020d0ae T __libc_csu_init U __libc_start_main 000000000020d178 t _fini 000000000020d160 t _init 000000000020c890 T _start 000000000020d14e t atexit 000000000020c8c0 t deregister_tm_clones 000000000020c960 t frame_dummy U free U hipFree U hipGetDeviceProperties U hipLaunchKernel U hipMalloc U hipMemcpy 000000000020cb00 T main U malloc U printf U puts 000000000020c8f0 t register_tm_clones U strlen ``` With binary mode instrumentation only the functions under text are instrumented. Try: `srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose` ## Custom including/excluding functions * To include: `srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -I 'function_name1' 'function_name2' -- ./MatrixTranspose` * To exclude: `srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -E 'function_name1' 'function_name2' -- ./MatrixTranspose` ## Decreasing profiling overhead * We enable some verbose mode, we add the flag `--simulate` to activate simulation mode and the option `--print-available functions` to print the available functions for instrumentation and their cycles. For such option we need to rewrite the binary with `-o name`. srun -n 1 omnitrace -v -1 --simulate --print-available functions -o matrix.inst -- ./MatrixTranspose ``` [omnitrace][exe] [omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'... [omnitrace][exe] [omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'... [omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1 [omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose [omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done [omnitrace][exe] Getting the address space image, modules, and procedures... [omnitrace][exe] [omnitrace][exe] Found 16 functions in 6 modules in instrumentation target [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done [omnitrace][exe] function: 'main' ... found [omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found [omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found [omnitrace][exe] function: 'MPI_Init' ... not found [omnitrace][exe] function: 'MPI_Init_thread' ... not found [omnitrace][exe] function: 'MPI_Finalize' ... not found [omnitrace][exe] function: 'MPI_Comm_rank' ... not found [omnitrace][exe] function: 'MPI_Comm_size' ... not found [omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] Finding instrumentation functions... [omnitrace][exe] function: 'omnitrace_init' ... found [omnitrace][exe] function: 'omnitrace_finalize' ... found [omnitrace][exe] function: 'omnitrace_set_env' ... found [omnitrace][exe] function: 'omnitrace_set_mpi' ... found [omnitrace][exe] function: 'omnitrace_push_trace' ... found [omnitrace][exe] function: 'omnitrace_pop_trace' ... found [omnitrace][exe] function: 'omnitrace_register_source' ... found [omnitrace][exe] function: 'omnitrace_register_coverage' ... found [omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] Adding main entry snippets... [omnitrace][exe] Adding main exit snippets... [omnitrace][exe] [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done [available] ../sysdeps/x86_64/crti.S: [available] [_fini][3] [available] [_init][7] [available] ../sysdeps/x86_64/start.S: [available] [_start][12] [available] MatrixTranspose: [available] [__device_stub__matrixTranspose][26] [available] [__device_stub__warmup][17] [available] [__do_global_dtors_aux][9] [available] [__hip_module_ctor][33] [available] [__hip_module_dtor][8] [available] [frame_dummy][4] [available] [targ20d132][1] [available] MatrixTranspose.cpp: [available] [_GLOBAL__sub_I_MatrixTranspose.cpp][8] [available] [main][278] [available] [matrixTransposeCPUReference][51] [available] atexit.c: [available] [atexit][5] [available] elf-init.c: [available] [__libc_csu_fini][3] [available] [__libc_csu_init][36] ``` Note: in the brackets [] is the number of the cycles. We care for the functions in the MatrixTrasnspose.cpp and the minimum sycles is 8. We can define the number of the minimum sycles with the flag `-i 8`, thus we can try the command: `srun -n 1 omnitrace -v -l --simulate --print-instrumented functions -i 8 -o matrix.inst -- ./MatrixTranspose` ``` [omnitrace][exe] [omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'... [omnitrace][exe] [omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'... [omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1 [omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose [omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done [omnitrace][exe] Getting the address space image, modules, and procedures... [omnitrace][exe] [omnitrace][exe] Found 16 functions in 6 modules in instrumentation target [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done [omnitrace][exe] function: 'main' ... found [omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found [omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found [omnitrace][exe] function: 'MPI_Init' ... not found [omnitrace][exe] function: 'MPI_Init_thread' ... not found [omnitrace][exe] function: 'MPI_Finalize' ... not found [omnitrace][exe] function: 'MPI_Comm_rank' ... not found [omnitrace][exe] function: 'MPI_Comm_size' ... not found [omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] Finding instrumentation functions... [omnitrace][exe] function: 'omnitrace_init' ... found [omnitrace][exe] function: 'omnitrace_finalize' ... found [omnitrace][exe] function: 'omnitrace_set_env' ... found [omnitrace][exe] function: 'omnitrace_set_mpi' ... found [omnitrace][exe] function: 'omnitrace_push_trace' ... found [omnitrace][exe] function: 'omnitrace_pop_trace' ... found [omnitrace][exe] function: 'omnitrace_register_source' ... found [omnitrace][exe] function: 'omnitrace_register_coverage' ... found [omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'... [omnitrace][exe] Adding main entry snippets... [omnitrace][exe] Adding main exit snippets... [omnitrace][exe] [function][Instrumenting] no-constraint :: 'matrixTransposeCPUReference'... [omnitrace][exe] [function][Loop Instrumenting] no-constraint :: 'matrixTransposeCPUReference'... [omnitrace][exe] 1 instrumented funcs in MatrixTranspose.cpp [omnitrace][exe] 1 instrumented loops in MatrixTranspose.cpp [omnitrace][exe] [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done [omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done [instrumented] MatrixTranspose.cpp: [instrumented] [main][278] [instrumented] [matrixTransposeCPUReference][51] ``` We used the option `--print-instrumented functions` to print the instrumented functions and now only two functions are instrumented. Let's create the `matrix.inst` file: ` srun -n 1 --gpus 1 omnitrace -i 8 -o matrix.inst -- ./MatrixTranspose` If we execute now the new binary, called `matrix.inst`: ``` srun -n 1 --gpus 1 ./matrix.inst [omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 [364.879] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" Device name PASSED! [omnitrace][128457][0][omnitrace_finalize] finalizing... [omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457 : 0.342078 sec wall_clock, 219.956 MB peak_rss, 212.865 MB page_rss, 0.620000 sec cpu_clock, 181.2 % cpu_util [laps: 1] [omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457/thread/0 : 0.342018 sec wall_clock, 0.300716 sec thread_cpu_clock, 87.9 % thread_cpu_util, 219.956 MB peak_rss [laps: 1] [omnitrace][128457][0][omnitrace_finalize] Finalizing perfetto... [omnitrace][128457][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-matrix.inst-output/perfetto-trace.proto' (993.07 KB / 0.99 MB / 0.00 GB)... Done [omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.json' [omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.txt' [omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.json' [omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.txt' [omnitrace][128457][manager::finalize][metadata]> Outputting 'omnitrace-matrix.inst-output/metadata.json' and 'omnitrace-matrix.inst-output/functions.json' [omnitrace][128457][0][omnitrace_finalize] Finalized [365.282] perfetto.cc:57383 Tracing session 1 ended, total sessions:0 ``` Now the overhead is just around 50% compared to over 100x times as before. A new directory called `omnitrace-matrix.inst-output` created with performance analysis data. We can read some of the text files. omnitrace-matrix.inst-output/wall_clock.txt ![](https://i.imgur.com/ukxnLrp.png) ![](https://i.imgur.com/SgrdUBY.png) * The file omnitrace-matrix.inst-output/perfetto-trace.proto is the one we can visualize with Perfetto (https://ui.perfetto.dev/) ## Visualizing Trace * Open with a browser the link https://ui.perfetto.dev/ and load the file `omnitrace-matrix.inst-output/perfetto-trace.proto` You can zoom with the keystroke W, unzoom with S, move left with A and move right with D. ![](https://i.imgur.com/gjlUMVd.png) fore. Note: We van observe a really long `hipmemcpy` thew reason for this is that if the GPU is not utilized, the first utilization has significant overhead. We add in the code and execute an empty kernel, such as: ``` __global__ void warmup() {} ... warmup<<<1,64>>>(); ``` * Create the new instrumented file and visualize: ![](https://i.imgur.com/4E11EUG.png) Now the duration of the `hipMemcpy` is as it is expected. ## Configuration Files Omnitrace uses a configuration file so the user can declare what options would like to use. * Create a configuration file with the default values: `omnitrace-avail -G omnitrace.cfg` * Create a configuration file with comments and default values: `omnitrace-avail -G omnitrace.cfg --all` In order to declare which configuration file to use, declare: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` or whatever name you have declared. ### Contents of the configuration file The default configuration files for v1.7, looks like: ``` OMNITRACE_CONFIG_FILE = OMNITRACE_USE_PERFETTO = true OMNITRACE_USE_TIMEMORY = true OMNITRACE_USE_SAMPLING = false OMNITRACE_USE_PROCESS_SAMPLING = true OMNITRACE_USE_ROCTRACER = true OMNITRACE_USE_ROCM_SMI = true OMNITRACE_USE_KOKKOSP = false OMNITRACE_USE_PID = false OMNITRACE_USE_RCCLP = false OMNITRACE_USE_ROCPROFILER = true OMNITRACE_USE_ROCTX = false OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output OMNITRACE_OUTPUT_PREFIX = OMNITRACE_CRITICAL_TRACE = false OMNITRACE_PERFETTO_BACKEND = inprocess OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000 OMNITRACE_PERFETTO_FILL_POLICY = discard OMNITRACE_PROCESS_SAMPLING_DURATION = -1 OMNITRACE_PROCESS_SAMPLING_FREQ = 0 OMNITRACE_ROCM_EVENTS = OMNITRACE_SAMPLING_CPUS = 0-3 OMNITRACE_SAMPLING_DELAY = 0.5 OMNITRACE_SAMPLING_DURATION = 0 OMNITRACE_SAMPLING_FREQ = 100 OMNITRACE_SAMPLING_GPUS = all OMNITRACE_TIME_OUTPUT = false OMNITRACE_TIMEMORY_COMPONENTS = wall_clock,cpu_clock,page_rss,cpu_util,papi_vector OMNITRACE_VERBOSE = 0 OMNITRACE_ENABLED = true OMNITRACE_PAPI_EVENTS = OMNITRACE_SUPPRESS_CONFIG = false OMNITRACE_SUPPRESS_PARSING = false ``` ## Metrics * If Omnitrace reports a message `/proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)` then you can not have access to HW counters before a system admin change the value. `srun -n 1 --gpus 1 omnitrace-avail --all` ``` ... | GPU | | | |---------------------------------------|--------------|---------------------------------------| | SQ_INSTS_VMEM_WR:device=0 | true | Number of VMEM write instructions ... | | SQ_INSTS_VMEM_RD:device=0 | true | Number of VMEM read instructions i... | | SQ_INSTS_SALU:device=0 | true | Number of SALU instructions issued... | | SQ_INSTS_SMEM:device=0 | true | Number of SMEM instructions issued... | | SQ_INSTS_FLAT:device=0 | true | Number of FLAT instructions issued... | | SQ_INSTS_FLAT_LDS_ONLY:device=0 | true | Number of FLAT instructions issued... | | SQ_INSTS_LDS:device=0 | true | Number of LDS instructions issued ... | | SQ_INSTS_GDS:device=0 | true | Number of GDS instructions issued.... | | SQ_WAIT_INST_LDS:device=0 | true | Number of wave-cycles spent waitin... | | SQ_ACTIVE_INST_VALU:device=0 | true | regspec 71? Number of cycles the S... | | SQ_INST_CYCLES_SALU:device=0 | true | Number of cycles needed to execute... | | SQ_THREAD_CYCLES_VALU:device=0 | true | Number of thread-cycles used to ex... | | SQ_LDS_BANK_CONFLICT:device=0 | true | Number of cycles LDS is stalled by... | | TCC_HIT[0]:device=0 | true | Number of cache hits. | | TCC_HIT[1]:device=0 | true | Number of cache hits. | | TCC_HIT[2]:device=0 | true | Number of cache hits. | | TCC_HIT[3]:device=0 | true | Number of cache hits. | | TCC_HIT[4]:device=0 | true | Number of cache hits. | | TCC_HIT[5]:device=0 | true | Number of cache hits. | | TCC_HIT[6]:device=0 | true | Number of cache hits. | ... ``` * Define in the cfg file metric, for example: `OMNITRACE_ROCM_EVENTS = GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0` The 0 is for the GPU 0, adjust as you wish based on the metrics and which GPUs to use the metrics for * List of components ``` omnitrace-avail -C |---------------------------------|---------------| | COMPONENT | AVAILABLE | |---------------------------------|---------------| | allinea_map | false | | caliper_marker | false | | caliper_config | false | | caliper_loop_marker | false | | cpu_clock | true | | cpu_util | true | | craypat_counters | false | | craypat_flush_buffer | false | | craypat_heap_stats | false | | craypat_record | false | | craypat_region | false | | current_peak_rss | true | | gperftools_cpu_profiler | false | | gperftools_heap_profiler | false | | hip_event | false | | kernel_mode_time | true | | likwid_marker | false | | likwid_nvmarker | false | | malloc_gotcha | true | | memory_allocations | true | | monotonic_clock | true | | monotonic_raw_clock | true | | network_stats | true | | num_io_in | true | | num_io_out | true | | num_major_page_faults | true | | num_minor_page_faults | true | | page_rss | true | | papi_array<12ul> | true | | papi_vector | true | | peak_rss | true | | perfetto_trace | false | | priority_context_switch | true | | process_cpu_clock | true | | process_cpu_util | true | | read_bytes | true | | read_char | true | | roctx_marker | false | | system_clock | true | | tau_marker | false | | thread_cpu_clock | true | | thread_cpu_util | true | | timestamp | true | | trip_count | true | | user_clock | true | | user_mode_time | true | | virtual_memory | true | | voluntary_context_switch | true | | vtune_event | false | | vtune_frame | false | | vtune_profiler | false | | wall_clock | true | | written_bytes | true | | written_char | true | | roctracer | true | | rocprofiler | true | | sampling_wall_clock | true | | sampling_cpu_clock | true | | sampling_percent | true | | sampling_gpu_power | true | | sampling_gpu_temp | true | | sampling_gpu_busy | true | | sampling_gpu_memory_usage | true | |---------------------------------|---------------| ``` ## Execution in Sampling mode ``` omnitrace-sample --help ... Options: -h, -?, --help Shows this page [DEBUG OPTIONS] --monochrome Disable colorized output --debug Debug output -v, --verbose Verbose output [GENERAL OPTIONS] -c, --config Configuration file -o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix -T, --trace Generate a detailed trace (perfetto output) -P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) -F, --flat-profile Generate a flat profile (conflicts with --profile) -H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. -D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. [TRACING OPTIONS] --trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. --trace-buffer-size Size limit for the trace output (in KB) --trace-fill-policy [ discard | ring_buffer ] Policy for new data when the buffer size limit is reached: - discard : new data is ignored - ring_buffer : new data overwrites oldest data [PROFILE OPTIONS] --profile-format [ console | json | text ] Data formats for profiling results --profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters corresponding to the input path and the input prefix [HOST/DEVICE (PROCESS SAMPLING) OPTIONS] --process-freq Set the default host/device sampling frequency (number of interrupts per second) --process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) --process-duration Set the duration of the host/device sampling (in seconds of realtime) --cpus CPU IDs for frequency sampling. Supports integers and/or ranges --gpus GPU IDs for SMI queries. Supports integers and/or ranges [GENERAL SAMPLING OPTIONS] -f, --freq Set the default sampling frequency (number of interrupts per second) -w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime -d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time delay that exceeds the real-time duration... resulting in zero samples being taken -t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target application is assigned an atomically incrementing value. [SAMPLING TIMER OPTIONS] --cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments: 0. Enables sampling based on CPU-clock timer. 1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time. 2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample. 3+ Thread IDs to target for sampling, starting at 0 (the main thread). May be specified as index or range, e.g., '0 2-4' will be interpreted as: sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads --realtime Sample based on a real-clock timer. Accepts zero or more arguments: 0. Enables sampling based on real-clock timer. 1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime. 2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample. 3+ Thread IDs to target for sampling, starting at 0 (the main thread). May be specified as index or range, e.g., '0 2-4' will be interpreted as: sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle" to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples) whereas the CPU-clock time does not. [BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation) -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ] Include data from these backends -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ] Exclude data from these backends [HARDWARE COUNTER OPTIONS] -C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) -G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) [MISCELLANEOUS OPTIONS] -i, --inlines Include inline info in output when available --hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable. ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on performance. Values: 0 avoid triggering the bug, potentially at the cost of reduced performance 1 do not modify how ROCm is notified about kernel completion ``` ``` srun -n 1 --gpus 1 omnitrace-sample -- ./MatrixTranspose [omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 [609.773] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" Device name Device name [omnitrace][107567][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(322609988575074:322609987794326) delta=-780748, device=0, queue=0, pid=0, tid=0, op=COPY PASSED! [omnitrace][107567][0][omnitrace_finalize] [omnitrace][107567][0][omnitrace_finalize] finalizing... [omnitrace][107567][0][omnitrace_finalize] [omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567 : 1.225706 sec wall_clock, 218.244 MB peak_rss, 211.304 MB page_rss, 0.210000 sec cpu_clock, 17.1 % cpu_util [laps: 1] [omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567/thread/0 : 1.221656 sec wall_clock, 0.209007 sec thread_cpu_clock, 17.1 % thread_cpu_util, 217.820 MB peak_rss [laps: 2] [omnitrace][107567][0][omnitrace_finalize] [omnitrace][107567][0][omnitrace_finalize] Finalizing perfetto... [omnitrace][107567][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-10_15.28/perfetto-trace.proto' (23.16 KB / 0.02 MB / 0.00 GB)... Done [omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.json' [omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.txt' [omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.json' [omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.txt' [omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.json' [omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.txt' [omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.json' [omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.txt' [omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.json' [omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.txt' [omnitrace][107567][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/metadata-107567.json' and 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/functions-107567.json' [omnitrace][107567][0][omnitrace_finalize] Finalized [611.040] perfetto.cc:57383 Tracing session 1 ended, total sessions:0 ``` ![](https://i.imgur.com/l74np4p.png) ``` srun -n 1 --gpus 1 omnitrace-sample -H -D -- ./MatrixTranspose [omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 [619.733] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" Device name Device name [omnitrace][9441][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(642619960562938:642619959822068) delta=-740870, device=0, queue=0, pid=0, tid=0, op=COPY PASSED! [omnitrace][9441][0][omnitrace_finalize] [omnitrace][9441][0][omnitrace_finalize] finalizing... [omnitrace][9441][0][omnitrace_finalize] [omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441 : 1.234547 sec wall_clock, 218.056 MB peak_rss, 211.100 MB page_rss, 0.280000 sec cpu_clock, 22.7 % cpu_util [laps: 1] [omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441/thread/0 : 1.229435 sec wall_clock, 0.211211 sec thread_cpu_clock, 17.2 % thread_cpu_util, 217.644 MB peak_rss [laps: 2] [omnitrace][9441][0][omnitrace_finalize] [omnitrace][9441][0][omnitrace_finalize] Finalizing perfetto... [omnitrace][9441][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-18_22.52/perfetto-trace.proto' (112.13 KB / 0.11 MB / 0.00 GB)... Done [omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.json' [omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.txt' [omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.json' [omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.txt' [omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.json' [omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.txt' [omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.json' [omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.txt' [omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.json' [omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.txt' [omnitrace][9441][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/metadata-9441.json' and 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/functions-9441.json' [omnitrace][9441][0][omnitrace_finalize] Finalized [621.025] perfetto.cc:57383 Tracing session 1 ended, total sessions:0 ``` ![](https://i.imgur.com/qupEAjR.png) Declaring the metrics in the command ``` srun -n 1 --gpus 1 omnitrace-sample -TPHD -G "GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0" -- ./MatrixTranspose [omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| [omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)... [omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2 [681.799] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" Device name Device name [omnitrace][31487][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(646682055581903:646682054882615) delta=-699288, device=0, queue=0, pid=0, tid=0, op=COPY PASSED! [omnitrace][31487][0][omnitrace_finalize] [omnitrace][31487][0][omnitrace_finalize] finalizing... [omnitrace][31487][0][omnitrace_finalize] [omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487 : 1.266008 sec wall_clock, 224.308 MB peak_rss, 217.502 MB page_rss, 0.320000 sec cpu_clock, 25.3 % cpu_util [laps: 1] [omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487/thread/0 : 1.259163 sec wall_clock, 0.236662 sec thread_cpu_clock, 18.8 % thread_cpu_util, 223.840 MB peak_rss [laps: 2] [omnitrace][31487][0][omnitrace_finalize] [omnitrace][31487][0][omnitrace_finalize] Finalizing perfetto... [omnitrace][31487][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-19_00.00/perfetto-trace.proto' (218.72 KB / 0.22 MB / 0.00 GB)... Done [omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.json' [omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.txt' [omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.json' [omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.txt' [omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.json' [omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.txt' [omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.json' [omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.txt' [omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.json' [omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.txt' [omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.json' [omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.txt' [omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.json' [omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.txt' [omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.json' [omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.txt' [omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.json' [omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.txt' [omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.json' [omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.txt' [omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.json' [omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.txt' [omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.json' [omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.txt' [omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.json' [omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.txt' [omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.json' [omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.txt' [omnitrace][31487][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/metadata-31487.json' and 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/functions-31487.json' [omnitrace][31487][0][omnitrace_finalize] Finalized [683.153] perfetto.cc:57383 Tracing session 1 ended, total sessions:0 ``` ![](https://i.imgur.com/ZVheV26.png) ## Selective Instrumentation