Omintrace

Setup

  • Download the most suitable version from here:

https://github.com/AMDResearch/omnitrace/releases

Full documentation: https://amdresearch.github.io/omnitrace/

Checking installation

Setup the environment

source omnitrace_installation_path/share/omnitrace/setup-env.sh

Checking omnitrace

omnitrace --help
omnitrace-avail --help

Note: There is a chance to need to declare LD_LIBRARY_PATH, do not forget to load ROCm module

Instrumentation modes

  • Runtime instrumentation (dynamic binary instrumentation)
    • It can instrument a lot of data and increased overhead if a user does not do the proper declarations
  • Sampling instrumentation
    • Use the omnitrace-sample
  • Attaching to a process (-p )
    • Attaching to a process already running, it can have similar issues as above
  • Binary rewriting (-o )
    • It will not instrument the dynamically-linked libraries. Consequently, this mode performs the instrumentation significantly faster and has a much lower overhead when running the instrumentated executable
    • Binary rewriting is the recommended mode when the target executable uses process-level parallelism (e.g. MPI)
    • Binary rewriting with dynamic libraries: https://amdresearch.github.io/omnitrace/instrumenting.html#binary-rewriting-a-library

Runtime instrumentation

omnitrace <omnitrace-options> -- <exe> [<exe-options>]

Sampling instrumentation

omnitrace-sample <omnitrace-options> -- <exe> [<exe-options>]

Binary Rewrite

omnitrace <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>

Application-Example

Download HIP examples:

git clone https://github.com/ROCm-Developer-Tools/HIP.git
cd HIP/samples/2_Cookbook/0_MatrixTranspose

Let's see the main content of the

// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {
    int x = blockDim.x * blockIdx.x + threadIdx.x;
    int y = blockDim.y * blockIdx.y + threadIdx.y;

    out[y * width + x] = in[x * width + y];
}
// allocate the memory on the device side
    hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
    hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
    // Memory transfer from host to device
    hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
 std::cout << "Device name " << devProp.name << std::endl;
    hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);`

    // Lauching kernel from host
    hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
                    dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix,
                    gpuMatrix, WIDTH);

    // Memory transfer from device to host
    hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);

    // CPU MatrixTranspose computation
    matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);`

Reserve 1 GPU

salloc -N 1 -A project_462000075 -p gpu -n 1 --gpus 1 -t 01:00:00
module load rocm/5.0.2

Compile and Execute

Compile and Execute MatrixTranspose example:

  • Comment the execution of test in the line 41 in the Makefile to avoid executing the binary:
    all: $(EXECUTABLE) test -> all: $(EXECUTABLE) #test

  • Compile
    make

  • Execute

srun -n 1 ./MatrixTranspose

Error: srun -n 1 ./MatrixTranspose "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" srun: error: nid005660: task 0: Aborted srun: launch/slurm: _step_signal: Terminating StepId=1731160.5

  • Solution:
    • We need to compile with the flag --offload-arch=gfx90a, so in the Makefile convert the CXXFLAGS =-g to CXXFLAGS =-g --offload-arch=gfx90a. You can add also other variable like HIPFLAGS and adjust the Makefile
    • Compile on the compute node srun -n 1 make

It is better to define the flags.If you compile with Cray compiler. loading the module craype-accel-amd-gfx90a is enough.

  • Timing:
    srun -n 1 ./MatrixTranspose PASSED!

Check Omnitrace configuration

srun -n 1 omnitrace-avail --categories omnitrace --brief --description
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

srun -n 1 omnitrace-avail --categories omnitrace

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Create a configuration file

  • All the options with some default settings
    srun -n 1 omnitrace-avail -G omnitrace.cfg

  • All the options with description included
    srun -n 1 omnitrace-avail -G omnitrace.cfg --all

In order to define which configuration file will be used, declare:
export OMNITRACE_CONFIG_FILE=/path/name.cfg

Execution in Dynamic instrumentation mode

  • Load the appropriate modules
srun -n 1 omnitrace --help
[omnitrace][exe]
[omnitrace][exe] command :: ''...
[omnitrace][exe]
[omnitrace] Usage: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/bin/omnitrace 
[ --help (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
--debug (max: 1, dtype: bool)
--log (count: 1)
--log-file (count: 1)
--simulate (max: 1, dtype: bool)
--print-format (min: 1, dtype: string)
--print-dir (count: 1, dtype: string)
--print-available (count: 1)
--print-instrumented (count: 1)
--print-coverage (count: 1)
--print-excluded (count: 1)
--print-overlapping (count: 1)
--print-instructions (max: 1, dtype: bool)
--output (min: 0, dtype: string)
--pid (count: 1, dtype: int)
--mode (count: 1)
--force (max: 1, dtype: bool)
--command (count: 1)
--prefer (count: 1)
--library (count: unlimited)
--main-function (count: 1)
--load (count: unlimited, dtype: string)
--load-instr (count: unlimited, dtype: filepath)
--init-functions (count: unlimited, dtype: string)
--fini-functions (count: unlimited, dtype: string)
--all-functions (max: 1, dtype: bool)
--function-include (count: unlimited)
--function-exclude (count: unlimited)
--function-restrict (count: unlimited)
--module-include (count: unlimited)
--module-exclude (count: unlimited)
--module-restrict (count: unlimited)
--label (count: unlimited, dtype: string)
--config (min: 1, dtype: string)
--default-components (count: unlimited, dtype: string)
--env (count: unlimited)
--mpi (max: 1, dtype: bool)
--instrument-loops (max: 1, dtype: boolean)
--min-instructions (count: 1, dtype: int)
--min-address-range (count: 1, dtype: int)
--min-instructions-loop (count: 1, dtype: int)
--min-address-range-loop (count: 1, dtype: int)
--coverage (max: 1, dtype: bool)
--dynamic-callsites (max: 1, dtype: boolean)
--traps (max: 1, dtype: bool)
--loop-traps (max: 1, dtype: bool)
--allow-overlapping (max: 1, dtype: bool)
--parse-all-modules (max: 1, dtype: bool)
--batch-size (count: 1, dtype: int)
--dyninst-rt (min: 1, dtype: filepath)
--dyninst-options (count: unlimited)
] -- <CMD> <ARGS>

srun -n 1 --gpus 1 omnitrace -- ./MatrixTranspose
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe]    2 instrumented funcs in libamd_comgr.so.2.4.50002
[omnitrace][exe]    3 instrumented funcs in libamdhip64.so.5.0.50002
[omnitrace][exe]   15 instrumented funcs in libhsa-runtime64.so.1.5.50002
[omnitrace][exe]    2 instrumented funcs in libtinfo.so.6.1
[omnitrace][exe]    3 instrumented funcs in libz.so.1.2.11
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.txt'... Done
[omnitrace][exe] Executing...
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|


[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[706.292]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][118151][2042][hip_activity_callback]    1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice     :: CopyHostToDevice     :: cid=7, time_ns=(255706649483053:255706648575215) delta=-907838, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] finalizing...
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151 : 0.364673 sec wall_clock,  237.052 MB peak_rss,  230.597 MB page_rss, 0.330000 sec cpu_clock,   90.6 % cpu_util [laps: 1]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151/thread/0 : 0.364280 sec wall_clock, 0.289480 sec thread_cpu_clock,   79.5 % thread_cpu_util,  237.028 MB peak_rss [laps: 1]
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][118151][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-16_22.53/perfetto-trace.proto' (57.18 KB / 0.06 MB / 0.00 GB)... Done
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.json'
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.txt'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.json'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.txt'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.json'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.txt'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.json'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.txt'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.json'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.txt'
[omnitrace][118151][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/metadata-118151.json' and 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/functions-118151.json'
[omnitrace][118151][0][omnitrace_finalize] Finalized
[706.822]       perfetto.cc:57383 Tracing session 1 ended, total sessions:0
[omnitrace][exe] End of omnitrace

In this case we have more than 120x times profiling overheard.

Why so much overhead?

nm --demangle MatrixTranspose | egrep -i ' (t|u) '
000000000020d080 t _GLOBAL__sub_I_MatrixTranspose.cpp
000000000020c970 T __device_stub__warmup()
000000000020ca40 T matrixTransposeCPUReference(float*, float*, unsigned int)
000000000020c9c0 T __device_stub__matrixTranspose(float*, float*, int)
                 U std::ctype<char>::_M_widen_init() const
                 U std::ostream::put(char)
                 U std::ostream::flush()
                 U std::ios_base::Init::Init()
                 U std::ios_base::Init::~Init()
                 U std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
                 U std::__throw_bad_cast()
                 U __cxa_atexit
000000000020c930 t __do_global_dtors_aux
                 U __hipPopCallConfiguration
                 U __hipPushCallConfiguration
                 U __hipRegisterFatBinary
                 U __hipRegisterFunction
                 U __hipUnregisterFatBinary
000000000020cfd0 t __hip_module_ctor
000000000020d060 t __hip_module_dtor
000000000020d12e T __libc_csu_fini
000000000020d0ae T __libc_csu_init
                 U __libc_start_main
000000000020d178 t _fini
000000000020d160 t _init
000000000020c890 T _start
000000000020d14e t atexit
000000000020c8c0 t deregister_tm_clones
000000000020c960 t frame_dummy
                 U free
                 U hipFree
                 U hipGetDeviceProperties
                 U hipLaunchKernel
                 U hipMalloc
                 U hipMemcpy
000000000020cb00 T main
                 U malloc
                 U printf
                 U puts
000000000020c8f0 t register_tm_clones
                 U strlen

With binary mode instrumentation only the functions under text are instrumented.

Try: srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose

Custom including/excluding functions

  • To include:

srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -I 'function_name1' 'function_name2' -- ./MatrixTranspose

  • To exclude:

srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -E 'function_name1' 'function_name2' -- ./MatrixTranspose

Decreasing profiling overhead

  • We enable some verbose mode, we add the flag --simulate to activate simulation mode and the option --print-available functions to print the available functions for instrumentation and their cycles. For such option we need to rewrite the binary with -o name.

srun -n 1 omnitrace -v -1 simulate print-available functions -o matrix.inst ./MatrixTranspose

[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done

[available] ../sysdeps/x86_64/crti.S:
[available]    [_fini][3]
[available]    [_init][7]

[available] ../sysdeps/x86_64/start.S:
[available]    [_start][12]

[available] MatrixTranspose:
[available]    [__device_stub__matrixTranspose][26]
[available]    [__device_stub__warmup][17]
[available]    [__do_global_dtors_aux][9]
[available]    [__hip_module_ctor][33]
[available]    [__hip_module_dtor][8]
[available]    [frame_dummy][4]
[available]    [targ20d132][1]

[available] MatrixTranspose.cpp:
[available]    [_GLOBAL__sub_I_MatrixTranspose.cpp][8]
[available]    [main][278]
[available]    [matrixTransposeCPUReference][51]

[available] atexit.c:
[available]    [atexit][5]

[available] elf-init.c:
[available]    [__libc_csu_fini][3]
[available]    [__libc_csu_init][36]

Note: in the brackets [] is the number of the cycles.
We care for the functions in the MatrixTrasnspose.cpp and the minimum sycles is 8. We can define the number of the minimum sycles with the flag -i 8, thus we can try the command:

srun -n 1 omnitrace -v -l --simulate --print-instrumented functions -i 8 -o matrix.inst -- ./MatrixTranspose

[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe] [function][Loop Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe]    1 instrumented funcs in MatrixTranspose.cpp
[omnitrace][exe]    1 instrumented loops in MatrixTranspose.cpp
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done

[instrumented] MatrixTranspose.cpp:
[instrumented]    [main][278]
[instrumented]    [matrixTransposeCPUReference][51]

We used the option --print-instrumented functions to print the instrumented functions and now only two functions are instrumented.

Let's create the matrix.inst file:
srun -n 1 --gpus 1 omnitrace -i 8 -o matrix.inst -- ./MatrixTranspose

If we execute now the new binary, called matrix.inst:

srun -n 1 --gpus 1 ./matrix.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|


[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[364.879]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

Device name
PASSED!

[omnitrace][128457][0][omnitrace_finalize] finalizing...


[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457 : 0.342078 sec wall_clock,  219.956 MB peak_rss,  212.865 MB page_rss, 0.620000 sec cpu_clock,  181.2 % cpu_util [laps: 1]
[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457/thread/0 : 0.342018 sec wall_clock, 0.300716 sec thread_cpu_clock,   87.9 % thread_cpu_util,  219.956 MB peak_rss [laps: 1]

[omnitrace][128457][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][128457][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-matrix.inst-output/perfetto-trace.proto' (993.07 KB / 0.99 MB / 0.00 GB)... Done
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.json'
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.txt'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.json'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.txt'
[omnitrace][128457][manager::finalize][metadata]> Outputting 'omnitrace-matrix.inst-output/metadata.json' and 'omnitrace-matrix.inst-output/functions.json'
[omnitrace][128457][0][omnitrace_finalize] Finalized
[365.282]       perfetto.cc:57383 Tracing session 1 ended, total sessions:0

Now the overhead is just around 50% compared to over 100x times as before.
A new directory called omnitrace-matrix.inst-output created with performance analysis data. We can read some of the text files.

omnitrace-matrix.inst-output/wall_clock.txt

  • The file omnitrace-matrix.inst-output/perfetto-trace.proto is the one we can visualize with Perfetto (https://ui.perfetto.dev/)

Visualizing Trace

  • Open with a browser the link https://ui.perfetto.dev/ and load the file omnitrace-matrix.inst-output/perfetto-trace.proto

You can zoom with the keystroke W, unzoom with S, move left with A and move right with D.

fore.

Note: We van observe a really long hipmemcpy thew reason for this is that if the GPU is not utilized, the first utilization has significant overhead.

We add in the code and execute an empty kernel, such as:

__global__ void warmup() {}
...
 warmup<<<1,64>>>();
  • Create the new instrumented file and visualize:

Now the duration of the hipMemcpy is as it is expected.

Configuration Files

Omnitrace uses a configuration file so the user can declare what options would like to use.

  • Create a configuration file with the default values: omnitrace-avail -G omnitrace.cfg
  • Create a configuration file with comments and default values: omnitrace-avail -G omnitrace.cfg --all

In order to declare which configuration file to use, declare: export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg or whatever name you have declared.

Contents of the configuration file

The default configuration files for v1.7, looks like:

OMNITRACE_CONFIG_FILE                              =
OMNITRACE_USE_PERFETTO                             = true
OMNITRACE_USE_TIMEMORY                             = true
OMNITRACE_USE_SAMPLING                             = false
OMNITRACE_USE_PROCESS_SAMPLING                     = true
OMNITRACE_USE_ROCTRACER                            = true
OMNITRACE_USE_ROCM_SMI                             = true
OMNITRACE_USE_KOKKOSP                              = false
OMNITRACE_USE_PID                                  = false
OMNITRACE_USE_RCCLP                                = false
OMNITRACE_USE_ROCPROFILER                          = true
OMNITRACE_USE_ROCTX                                = false
OMNITRACE_OUTPUT_PATH                              = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX                            =
OMNITRACE_CRITICAL_TRACE                           = false
OMNITRACE_PERFETTO_BACKEND                         = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB                  = 1024000
OMNITRACE_PERFETTO_FILL_POLICY                     = discard
OMNITRACE_PROCESS_SAMPLING_DURATION                = -1
OMNITRACE_PROCESS_SAMPLING_FREQ                    = 0
OMNITRACE_ROCM_EVENTS                              =
OMNITRACE_SAMPLING_CPUS                            = 0-3
OMNITRACE_SAMPLING_DELAY                           = 0.5
OMNITRACE_SAMPLING_DURATION                        = 0
OMNITRACE_SAMPLING_FREQ                            = 100
OMNITRACE_SAMPLING_GPUS                            = all
OMNITRACE_TIME_OUTPUT                              = false
OMNITRACE_TIMEMORY_COMPONENTS                      = wall_clock,cpu_clock,page_rss,cpu_util,papi_vector
OMNITRACE_VERBOSE                                  = 0
OMNITRACE_ENABLED                                  = true
OMNITRACE_PAPI_EVENTS                              =
OMNITRACE_SUPPRESS_CONFIG                          = false
OMNITRACE_SUPPRESS_PARSING                         = false

Metrics

  • If Omnitrace reports a message /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1) then you can not have access to HW counters before a system admin change the value.

srun -n 1 --gpus 1 omnitrace-avail --all

...
|                  GPU                  |              |                                       |
|---------------------------------------|--------------|---------------------------------------|
| SQ_INSTS_VMEM_WR:device=0             |     true     | Number of VMEM write instructions ... |
| SQ_INSTS_VMEM_RD:device=0             |     true     | Number of VMEM read instructions i... |
| SQ_INSTS_SALU:device=0                |     true     | Number of SALU instructions issued... |
| SQ_INSTS_SMEM:device=0                |     true     | Number of SMEM instructions issued... |
| SQ_INSTS_FLAT:device=0                |     true     | Number of FLAT instructions issued... |
| SQ_INSTS_FLAT_LDS_ONLY:device=0       |     true     | Number of FLAT instructions issued... |
| SQ_INSTS_LDS:device=0                 |     true     | Number of LDS instructions issued ... |
| SQ_INSTS_GDS:device=0                 |     true     | Number of GDS instructions issued.... |
| SQ_WAIT_INST_LDS:device=0             |     true     | Number of wave-cycles spent waitin... |
| SQ_ACTIVE_INST_VALU:device=0          |     true     | regspec 71? Number of cycles the S... |
| SQ_INST_CYCLES_SALU:device=0          |     true     | Number of cycles needed to execute... |
| SQ_THREAD_CYCLES_VALU:device=0        |     true     | Number of thread-cycles used to ex... |
| SQ_LDS_BANK_CONFLICT:device=0         |     true     | Number of cycles LDS is stalled by... |
| TCC_HIT[0]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[1]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[2]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[3]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[4]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[5]:device=0                   |     true     | Number of cache hits.                 |
| TCC_HIT[6]:device=0                   |     true     | Number of cache hits.                 |
...
  • Define in the cfg file metric, for example:

OMNITRACE_ROCM_EVENTS = GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0

The 0 is for the GPU 0, adjust as you wish based on the metrics and which GPUs to use the metrics for

  • List of components
omnitrace-avail -C

|---------------------------------|---------------|
|            COMPONENT            |   AVAILABLE   |
|---------------------------------|---------------|
| allinea_map                     |     false     |
| caliper_marker                  |     false     |
| caliper_config                  |     false     |
| caliper_loop_marker             |     false     |
| cpu_clock                       |     true      |
| cpu_util                        |     true      |
| craypat_counters                |     false     |
| craypat_flush_buffer            |     false     |
| craypat_heap_stats              |     false     |
| craypat_record                  |     false     |
| craypat_region                  |     false     |
| current_peak_rss                |     true      |
| gperftools_cpu_profiler         |     false     |
| gperftools_heap_profiler        |     false     |
| hip_event                       |     false     |
| kernel_mode_time                |     true      |
| likwid_marker                   |     false     |
| likwid_nvmarker                 |     false     |
| malloc_gotcha                   |     true      |
| memory_allocations              |     true      |
| monotonic_clock                 |     true      |
| monotonic_raw_clock             |     true      |
| network_stats                   |     true      |
| num_io_in                       |     true      |
| num_io_out                      |     true      |
| num_major_page_faults           |     true      |
| num_minor_page_faults           |     true      |
| page_rss                        |     true      |
| papi_array<12ul>                |     true      |
| papi_vector                     |     true      |
| peak_rss                        |     true      |
| perfetto_trace                  |     false     |
| priority_context_switch         |     true      |
| process_cpu_clock               |     true      |
| process_cpu_util                |     true      |
| read_bytes                      |     true      |
| read_char                       |     true      |
| roctx_marker                    |     false     |
| system_clock                    |     true      |
| tau_marker                      |     false     |
| thread_cpu_clock                |     true      |
| thread_cpu_util                 |     true      |
| timestamp                       |     true      |
| trip_count                      |     true      |
| user_clock                      |     true      |
| user_mode_time                  |     true      |
| virtual_memory                  |     true      |
| voluntary_context_switch        |     true      |
| vtune_event                     |     false     |
| vtune_frame                     |     false     |
| vtune_profiler                  |     false     |
| wall_clock                      |     true      |
| written_bytes                   |     true      |
| written_char                    |     true      |
| roctracer                       |     true      |
| rocprofiler                     |     true      |
| sampling_wall_clock             |     true      |
| sampling_cpu_clock              |     true      |
| sampling_percent                |     true      |
| sampling_gpu_power              |     true      |
| sampling_gpu_temp               |     true      |
| sampling_gpu_busy               |     true      |
| sampling_gpu_memory_usage       |     true      |
|---------------------------------|---------------|

Execution in Sampling mode

omnitrace-sample --help
...

Options:
    -h, -?, --help                 Shows this page

    [DEBUG OPTIONS]

    --monochrome                   Disable colorized output
    --debug                        Debug output
    -v, --verbose                  Verbose output

    [GENERAL OPTIONS]

    -c, --config                   Configuration file
    -o, --output                   Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
    -T, --trace                    Generate a detailed trace (perfetto output)
    -P, --profile                  Generate a call-stack-based profile (conflicts with --flat-profile)
    -F, --flat-profile             Generate a flat profile (conflicts with --profile)
    -H, --host                     Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
    -D, --device                   Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.

    [TRACING OPTIONS]

    --trace-file                   Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
    --trace-buffer-size            Size limit for the trace output (in KB)
    --trace-fill-policy [ discard | ring_buffer ]

                                   Policy for new data when the buffer size limit is reached:
                                       - discard     : new data is ignored
                                       - ring_buffer : new data overwrites oldest data

    [PROFILE OPTIONS]

    --profile-format [ console | json | text ]
                                   Data formats for profiling results
    --profile-diff                 Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
                                   corresponding to the input path and the input prefix

    [HOST/DEVICE (PROCESS SAMPLING) OPTIONS]


    --process-freq                 Set the default host/device sampling frequency (number of interrupts per second)
    --process-wait                 Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
    --process-duration             Set the duration of the host/device sampling (in seconds of realtime)
    --cpus                         CPU IDs for frequency sampling. Supports integers and/or ranges
    --gpus                         GPU IDs for SMI queries. Supports integers and/or ranges

    [GENERAL SAMPLING OPTIONS]

    -f, --freq                     Set the default sampling frequency (number of interrupts per second)
    -w, --wait                     Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
                                   of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
    -d, --duration                 Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
                                   delay that exceeds the real-time duration... resulting in zero samples being taken
    -t, --tids                     Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
                                   application is assigned an atomically incrementing value.

    [SAMPLING TIMER OPTIONS]

    --cputime                      Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
                                       0. Enables sampling based on CPU-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
                                       2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
    --realtime                     Sample based on a real-clock timer. Accepts zero or more arguments:
                                       0. Enables sampling based on real-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
                                       2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
                                          When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
                                          to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
                                          whereas the CPU-clock time does not.

    [BACKEND OPTIONS]  (These options control region information captured w/o sampling or instrumentation)

    -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Include data from these backends
    -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Exclude data from these backends

    [HARDWARE COUNTER OPTIONS]

    -C, --cpu-events               Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
    -G, --gpu-events               Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)

    [MISCELLANEOUS OPTIONS]

    -i, --inlines                  Include inline info in output when available
    --hsa-interrupt [ 0 | 1 ]      Set the value of the HSA_ENABLE_INTERRUPT environment variable.
                                     ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
                                     that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
                                     when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
                                     performance.
                                     Values:
                                       0     avoid triggering the bug, potentially at the cost of reduced performance
                                       1     do not modify how ROCm is notified about kernel completion


srun -n 1 --gpus 1 omnitrace-sample -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|


[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[609.773]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][107567][2042][hip_activity_callback]    1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice     :: CopyHostToDevice     :: cid=7, time_ns=(322609988575074:322609987794326) delta=-780748, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] finalizing...
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567 : 1.225706 sec wall_clock,  218.244 MB peak_rss,  211.304 MB page_rss, 0.210000 sec cpu_clock,   17.1 % cpu_util [laps: 1]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567/thread/0 : 1.221656 sec wall_clock, 0.209007 sec thread_cpu_clock,   17.1 % thread_cpu_util,  217.820 MB peak_rss [laps: 2]
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][107567][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-10_15.28/perfetto-trace.proto' (23.16 KB / 0.02 MB / 0.00 GB)... Done
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.json'
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.txt'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.json'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.txt'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.json'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.txt'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.json'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.txt'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.json'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.txt'
[omnitrace][107567][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/metadata-107567.json' and 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/functions-107567.json'
[omnitrace][107567][0][omnitrace_finalize] Finalized
[611.040]       perfetto.cc:57383 Tracing session 1 ended, total sessions:0

srun -n 1 --gpus 1 omnitrace-sample -H -D -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|


[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[619.733]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][9441][2041][hip_activity_callback]    1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice     :: CopyHostToDevice     :: cid=7, time_ns=(642619960562938:642619959822068) delta=-740870, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] finalizing...
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441 : 1.234547 sec wall_clock,  218.056 MB peak_rss,  211.100 MB page_rss, 0.280000 sec cpu_clock,   22.7 % cpu_util [laps: 1]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441/thread/0 : 1.229435 sec wall_clock, 0.211211 sec thread_cpu_clock,   17.2 % thread_cpu_util,  217.644 MB peak_rss [laps: 2]
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][9441][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-18_22.52/perfetto-trace.proto' (112.13 KB / 0.11 MB / 0.00 GB)... Done
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.json'
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.txt'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.json'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.txt'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.json'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.txt'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.json'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.txt'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.json'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.txt'
[omnitrace][9441][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/metadata-9441.json' and 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/functions-9441.json'
[omnitrace][9441][0][omnitrace_finalize] Finalized
[621.025]       perfetto.cc:57383 Tracing session 1 ended, total sessions:0

Declaring the metrics in the command

srun -n 1 --gpus 1 omnitrace-sample -TPHD -G "GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0" -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|


[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[681.799]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][31487][2041][hip_activity_callback]    1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice     :: CopyHostToDevice     :: cid=7, time_ns=(646682055581903:646682054882615) delta=-699288, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] finalizing...
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487 : 1.266008 sec wall_clock,  224.308 MB peak_rss,  217.502 MB page_rss, 0.320000 sec cpu_clock,   25.3 % cpu_util [laps: 1]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487/thread/0 : 1.259163 sec wall_clock, 0.236662 sec thread_cpu_clock,   18.8 % thread_cpu_util,  223.840 MB peak_rss [laps: 2]
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][31487][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-19_00.00/perfetto-trace.proto' (218.72 KB / 0.22 MB / 0.00 GB)... Done
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.json'
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.txt'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.json'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.txt'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.json'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.txt'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.json'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.txt'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.json'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.txt'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.json'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.txt'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.json'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.txt'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.json'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.txt'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.json'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.txt'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.json'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.txt'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.json'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.txt'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.json'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.txt'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.json'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.txt'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.json'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.txt'
[omnitrace][31487][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/metadata-31487.json' and 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/functions-31487.json'
[omnitrace][31487][0][omnitrace_finalize] Finalized
[683.153]       perfetto.cc:57383 Tracing session 1 ended, total sessions:0

Selective Instrumentation