# Omintrace
## Setup
* Download the most suitable version from here:
https://github.com/AMDResearch/omnitrace/releases
* Create the installation directory
* ./[omnitrace-1.7.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3.sh](https://github.com/AMDResearch/omnitrace/releases/download/v1.7.0/omnitrace-1.7.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3.sh) --prefix=/opt/omnitrace --exclude-subdir
Full documentation: https://amdresearch.github.io/omnitrace/
## Checking installation
### Setup the environment
` source omnitrace_installation_path/share/omnitrace/setup-env.sh`
### Checking omnitrace
`omnitrace --help`
`omnitrace-avail --help`
*Note:* There is a chance to need to declare LD_LIBRARY_PATH, do not forget to load ROCm module
## Instrumentation modes
* Runtime instrumentation (dynamic binary instrumentation)
* It can instrument a lot of data and increased overhead if a user does not do the proper declarations
* Sampling instrumentation
* Use the omnitrace-sample
* Attaching to a process (-p )
* Attaching to a process already running, it can have similar issues as above
* Binary rewriting (-o )
* It will not instrument the dynamically-linked libraries. Consequently, this mode performs the instrumentation significantly faster and has a much lower overhead when running the instrumentated executable
* Binary rewriting is the recommended mode when the target executable uses process-level parallelism (e.g. MPI)
* Binary rewriting with dynamic libraries: https://amdresearch.github.io/omnitrace/instrumenting.html#binary-rewriting-a-library
## Runtime instrumentation
`omnitrace <omnitrace-options> -- <exe> [<exe-options>]`
## Sampling instrumentation
`omnitrace-sample <omnitrace-options> -- <exe> [<exe-options>]`
## Binary Rewrite
`omnitrace <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>`
## Application-Example
### Download HIP examples:
`git clone https://github.com/ROCm-Developer-Tools/HIP.git`
`cd HIP/samples/2_Cookbook/0_MatrixTranspose`
Let's see the main content of the
```
// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
out[y * width + x] = in[x * width + y];
}
```
```
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
std::cout << "Device name " << devProp.name << std::endl;
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);`
// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix,
gpuMatrix, WIDTH);
// Memory transfer from device to host
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
// CPU MatrixTranspose computation
matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);`
```
### Reserve 1 GPU
`salloc -N 1 -A project_462000075 -p gpu -n 1 --gpus 1 -t 01:00:00`
`module load rocm/5.0.2`
### Compile and Execute
#### Compile and Execute MatrixTranspose example:
* Comment the execution of test in the line 41 in the Makefile to avoid executing the binary:
`all: $(EXECUTABLE) test` -> `all: $(EXECUTABLE) #test`
* Compile
`make`
* Execute
`srun -n 1 ./MatrixTranspose `
Error: `srun -n 1 ./MatrixTranspose
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
srun: error: nid005660: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=1731160.5`
* Solution:
* We need to compile with the flag `--offload-arch=gfx90a`, so in the Makefile convert the `CXXFLAGS =-g` to `CXXFLAGS =-g --offload-arch=gfx90a`. You can add also other variable like `HIPFLAGS` and adjust the Makefile
* Compile on the compute node `srun -n 1 make`
It is better to define the flags.If you compile with Cray compiler. loading the module ` craype-accel-amd-gfx90a` is enough.
* Timing:
`srun -n 1 ./MatrixTranspose
PASSED!`
## Check Omnitrace configuration
`srun -n 1 omnitrace-avail --categories omnitrace --brief --description`
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2

`srun -n 1 omnitrace-avail --categories omnitrace`

## Create a configuration file
* All the options with some default settings
`srun -n 1 omnitrace-avail -G omnitrace.cfg`
* All the options with description included
`srun -n 1 omnitrace-avail -G omnitrace.cfg --all`
In order to define which configuration file will be used, declare:
`export OMNITRACE_CONFIG_FILE=/path/name.cfg`
## Execution in Dynamic instrumentation mode
* Load the appropriate modules
```
srun -n 1 omnitrace --help
[omnitrace][exe]
[omnitrace][exe] command :: ''...
[omnitrace][exe]
[omnitrace] Usage: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/bin/omnitrace
[ --help (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
--debug (max: 1, dtype: bool)
--log (count: 1)
--log-file (count: 1)
--simulate (max: 1, dtype: bool)
--print-format (min: 1, dtype: string)
--print-dir (count: 1, dtype: string)
--print-available (count: 1)
--print-instrumented (count: 1)
--print-coverage (count: 1)
--print-excluded (count: 1)
--print-overlapping (count: 1)
--print-instructions (max: 1, dtype: bool)
--output (min: 0, dtype: string)
--pid (count: 1, dtype: int)
--mode (count: 1)
--force (max: 1, dtype: bool)
--command (count: 1)
--prefer (count: 1)
--library (count: unlimited)
--main-function (count: 1)
--load (count: unlimited, dtype: string)
--load-instr (count: unlimited, dtype: filepath)
--init-functions (count: unlimited, dtype: string)
--fini-functions (count: unlimited, dtype: string)
--all-functions (max: 1, dtype: bool)
--function-include (count: unlimited)
--function-exclude (count: unlimited)
--function-restrict (count: unlimited)
--module-include (count: unlimited)
--module-exclude (count: unlimited)
--module-restrict (count: unlimited)
--label (count: unlimited, dtype: string)
--config (min: 1, dtype: string)
--default-components (count: unlimited, dtype: string)
--env (count: unlimited)
--mpi (max: 1, dtype: bool)
--instrument-loops (max: 1, dtype: boolean)
--min-instructions (count: 1, dtype: int)
--min-address-range (count: 1, dtype: int)
--min-instructions-loop (count: 1, dtype: int)
--min-address-range-loop (count: 1, dtype: int)
--coverage (max: 1, dtype: bool)
--dynamic-callsites (max: 1, dtype: boolean)
--traps (max: 1, dtype: bool)
--loop-traps (max: 1, dtype: bool)
--allow-overlapping (max: 1, dtype: bool)
--parse-all-modules (max: 1, dtype: bool)
--batch-size (count: 1, dtype: int)
--dyninst-rt (min: 1, dtype: filepath)
--dyninst-options (count: unlimited)
] -- <CMD> <ARGS>
```
```
srun -n 1 --gpus 1 omnitrace -- ./MatrixTranspose
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] 2 instrumented funcs in libamd_comgr.so.2.4.50002
[omnitrace][exe] 3 instrumented funcs in libamdhip64.so.5.0.50002
[omnitrace][exe] 15 instrumented funcs in libhsa-runtime64.so.1.5.50002
[omnitrace][exe] 2 instrumented funcs in libtinfo.so.6.1
[omnitrace][exe] 3 instrumented funcs in libz.so.1.2.11
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.txt'... Done
[omnitrace][exe] Executing...
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[706.292] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][118151][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(255706649483053:255706648575215) delta=-907838, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] finalizing...
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151 : 0.364673 sec wall_clock, 237.052 MB peak_rss, 230.597 MB page_rss, 0.330000 sec cpu_clock, 90.6 % cpu_util [laps: 1]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151/thread/0 : 0.364280 sec wall_clock, 0.289480 sec thread_cpu_clock, 79.5 % thread_cpu_util, 237.028 MB peak_rss [laps: 1]
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][118151][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-16_22.53/perfetto-trace.proto' (57.18 KB / 0.06 MB / 0.00 GB)... Done
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.json'
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.txt'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.json'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.txt'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.json'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.txt'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.json'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.txt'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.json'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.txt'
[omnitrace][118151][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/metadata-118151.json' and 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/functions-118151.json'
[omnitrace][118151][0][omnitrace_finalize] Finalized
[706.822] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
[omnitrace][exe] End of omnitrace
```
In this case we have more than 120x times profiling overheard.
## Why so much overhead?
```
nm --demangle MatrixTranspose | egrep -i ' (t|u) '
000000000020d080 t _GLOBAL__sub_I_MatrixTranspose.cpp
000000000020c970 T __device_stub__warmup()
000000000020ca40 T matrixTransposeCPUReference(float*, float*, unsigned int)
000000000020c9c0 T __device_stub__matrixTranspose(float*, float*, int)
U std::ctype<char>::_M_widen_init() const
U std::ostream::put(char)
U std::ostream::flush()
U std::ios_base::Init::Init()
U std::ios_base::Init::~Init()
U std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
U std::__throw_bad_cast()
U __cxa_atexit
000000000020c930 t __do_global_dtors_aux
U __hipPopCallConfiguration
U __hipPushCallConfiguration
U __hipRegisterFatBinary
U __hipRegisterFunction
U __hipUnregisterFatBinary
000000000020cfd0 t __hip_module_ctor
000000000020d060 t __hip_module_dtor
000000000020d12e T __libc_csu_fini
000000000020d0ae T __libc_csu_init
U __libc_start_main
000000000020d178 t _fini
000000000020d160 t _init
000000000020c890 T _start
000000000020d14e t atexit
000000000020c8c0 t deregister_tm_clones
000000000020c960 t frame_dummy
U free
U hipFree
U hipGetDeviceProperties
U hipLaunchKernel
U hipMalloc
U hipMemcpy
000000000020cb00 T main
U malloc
U printf
U puts
000000000020c8f0 t register_tm_clones
U strlen
```
With binary mode instrumentation only the functions under text are instrumented.
Try: `srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose`
## Custom including/excluding functions
* To include:
`srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -I 'function_name1' 'function_name2' -- ./MatrixTranspose`
* To exclude:
`srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -E 'function_name1' 'function_name2' -- ./MatrixTranspose`
## Decreasing profiling overhead
* We enable some verbose mode, we add the flag `--simulate` to activate simulation mode and the option `--print-available functions` to print the available functions for instrumentation and their cycles. For such option we need to rewrite the binary with `-o name`.
srun -n 1 omnitrace -v -1 --simulate --print-available functions -o matrix.inst -- ./MatrixTranspose
```
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[available] ../sysdeps/x86_64/crti.S:
[available] [_fini][3]
[available] [_init][7]
[available] ../sysdeps/x86_64/start.S:
[available] [_start][12]
[available] MatrixTranspose:
[available] [__device_stub__matrixTranspose][26]
[available] [__device_stub__warmup][17]
[available] [__do_global_dtors_aux][9]
[available] [__hip_module_ctor][33]
[available] [__hip_module_dtor][8]
[available] [frame_dummy][4]
[available] [targ20d132][1]
[available] MatrixTranspose.cpp:
[available] [_GLOBAL__sub_I_MatrixTranspose.cpp][8]
[available] [main][278]
[available] [matrixTransposeCPUReference][51]
[available] atexit.c:
[available] [atexit][5]
[available] elf-init.c:
[available] [__libc_csu_fini][3]
[available] [__libc_csu_init][36]
```
Note: in the brackets [] is the number of the cycles.
We care for the functions in the MatrixTrasnspose.cpp and the minimum sycles is 8. We can define the number of the minimum sycles with the flag `-i 8`, thus we can try the command:
`srun -n 1 omnitrace -v -l --simulate --print-instrumented functions -i 8 -o matrix.inst -- ./MatrixTranspose`
```
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe] [function][Loop Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe] 1 instrumented funcs in MatrixTranspose.cpp
[omnitrace][exe] 1 instrumented loops in MatrixTranspose.cpp
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[instrumented] MatrixTranspose.cpp:
[instrumented] [main][278]
[instrumented] [matrixTransposeCPUReference][51]
```
We used the option `--print-instrumented functions` to print the instrumented functions and now only two functions are instrumented.
Let's create the `matrix.inst` file:
` srun -n 1 --gpus 1 omnitrace -i 8 -o matrix.inst -- ./MatrixTranspose`
If we execute now the new binary, called `matrix.inst`:
```
srun -n 1 --gpus 1 ./matrix.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[364.879] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
PASSED!
[omnitrace][128457][0][omnitrace_finalize] finalizing...
[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457 : 0.342078 sec wall_clock, 219.956 MB peak_rss, 212.865 MB page_rss, 0.620000 sec cpu_clock, 181.2 % cpu_util [laps: 1]
[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457/thread/0 : 0.342018 sec wall_clock, 0.300716 sec thread_cpu_clock, 87.9 % thread_cpu_util, 219.956 MB peak_rss [laps: 1]
[omnitrace][128457][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][128457][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-matrix.inst-output/perfetto-trace.proto' (993.07 KB / 0.99 MB / 0.00 GB)... Done
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.json'
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.txt'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.json'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.txt'
[omnitrace][128457][manager::finalize][metadata]> Outputting 'omnitrace-matrix.inst-output/metadata.json' and 'omnitrace-matrix.inst-output/functions.json'
[omnitrace][128457][0][omnitrace_finalize] Finalized
[365.282] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
```
Now the overhead is just around 50% compared to over 100x times as before.
A new directory called `omnitrace-matrix.inst-output` created with performance analysis data. We can read some of the text files.
omnitrace-matrix.inst-output/wall_clock.txt


* The file omnitrace-matrix.inst-output/perfetto-trace.proto is the one we can visualize with Perfetto (https://ui.perfetto.dev/)
## Visualizing Trace
* Open with a browser the link https://ui.perfetto.dev/ and load the file `omnitrace-matrix.inst-output/perfetto-trace.proto`
You can zoom with the keystroke W, unzoom with S, move left with A and move right with D.

fore.
Note: We van observe a really long `hipmemcpy` thew reason for this is that if the GPU is not utilized, the first utilization has significant overhead.
We add in the code and execute an empty kernel, such as:
```
__global__ void warmup() {}
...
warmup<<<1,64>>>();
```
* Create the new instrumented file and visualize:

Now the duration of the `hipMemcpy` is as it is expected.
## Configuration Files
Omnitrace uses a configuration file so the user can declare what options would like to use.
* Create a configuration file with the default values: `omnitrace-avail -G omnitrace.cfg`
* Create a configuration file with comments and default values: `omnitrace-avail -G omnitrace.cfg --all`
In order to declare which configuration file to use, declare: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` or whatever name you have declared.
### Contents of the configuration file
The default configuration files for v1.7, looks like:
```
OMNITRACE_CONFIG_FILE =
OMNITRACE_USE_PERFETTO = true
OMNITRACE_USE_TIMEMORY = true
OMNITRACE_USE_SAMPLING = false
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCTRACER = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_USE_KOKKOSP = false
OMNITRACE_USE_PID = false
OMNITRACE_USE_RCCLP = false
OMNITRACE_USE_ROCPROFILER = true
OMNITRACE_USE_ROCTX = false
OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX =
OMNITRACE_CRITICAL_TRACE = false
OMNITRACE_PERFETTO_BACKEND = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000
OMNITRACE_PERFETTO_FILL_POLICY = discard
OMNITRACE_PROCESS_SAMPLING_DURATION = -1
OMNITRACE_PROCESS_SAMPLING_FREQ = 0
OMNITRACE_ROCM_EVENTS =
OMNITRACE_SAMPLING_CPUS = 0-3
OMNITRACE_SAMPLING_DELAY = 0.5
OMNITRACE_SAMPLING_DURATION = 0
OMNITRACE_SAMPLING_FREQ = 100
OMNITRACE_SAMPLING_GPUS = all
OMNITRACE_TIME_OUTPUT = false
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock,cpu_clock,page_rss,cpu_util,papi_vector
OMNITRACE_VERBOSE = 0
OMNITRACE_ENABLED = true
OMNITRACE_PAPI_EVENTS =
OMNITRACE_SUPPRESS_CONFIG = false
OMNITRACE_SUPPRESS_PARSING = false
```
## Metrics
* If Omnitrace reports a message `/proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)` then you can not have access to HW counters before a system admin change the value.
`srun -n 1 --gpus 1 omnitrace-avail --all`
```
...
| GPU | | |
|---------------------------------------|--------------|---------------------------------------|
| SQ_INSTS_VMEM_WR:device=0 | true | Number of VMEM write instructions ... |
| SQ_INSTS_VMEM_RD:device=0 | true | Number of VMEM read instructions i... |
| SQ_INSTS_SALU:device=0 | true | Number of SALU instructions issued... |
| SQ_INSTS_SMEM:device=0 | true | Number of SMEM instructions issued... |
| SQ_INSTS_FLAT:device=0 | true | Number of FLAT instructions issued... |
| SQ_INSTS_FLAT_LDS_ONLY:device=0 | true | Number of FLAT instructions issued... |
| SQ_INSTS_LDS:device=0 | true | Number of LDS instructions issued ... |
| SQ_INSTS_GDS:device=0 | true | Number of GDS instructions issued.... |
| SQ_WAIT_INST_LDS:device=0 | true | Number of wave-cycles spent waitin... |
| SQ_ACTIVE_INST_VALU:device=0 | true | regspec 71? Number of cycles the S... |
| SQ_INST_CYCLES_SALU:device=0 | true | Number of cycles needed to execute... |
| SQ_THREAD_CYCLES_VALU:device=0 | true | Number of thread-cycles used to ex... |
| SQ_LDS_BANK_CONFLICT:device=0 | true | Number of cycles LDS is stalled by... |
| TCC_HIT[0]:device=0 | true | Number of cache hits. |
| TCC_HIT[1]:device=0 | true | Number of cache hits. |
| TCC_HIT[2]:device=0 | true | Number of cache hits. |
| TCC_HIT[3]:device=0 | true | Number of cache hits. |
| TCC_HIT[4]:device=0 | true | Number of cache hits. |
| TCC_HIT[5]:device=0 | true | Number of cache hits. |
| TCC_HIT[6]:device=0 | true | Number of cache hits. |
...
```
* Define in the cfg file metric, for example:
`OMNITRACE_ROCM_EVENTS = GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0`
The 0 is for the GPU 0, adjust as you wish based on the metrics and which GPUs to use the metrics for
* List of components
```
omnitrace-avail -C
|---------------------------------|---------------|
| COMPONENT | AVAILABLE |
|---------------------------------|---------------|
| allinea_map | false |
| caliper_marker | false |
| caliper_config | false |
| caliper_loop_marker | false |
| cpu_clock | true |
| cpu_util | true |
| craypat_counters | false |
| craypat_flush_buffer | false |
| craypat_heap_stats | false |
| craypat_record | false |
| craypat_region | false |
| current_peak_rss | true |
| gperftools_cpu_profiler | false |
| gperftools_heap_profiler | false |
| hip_event | false |
| kernel_mode_time | true |
| likwid_marker | false |
| likwid_nvmarker | false |
| malloc_gotcha | true |
| memory_allocations | true |
| monotonic_clock | true |
| monotonic_raw_clock | true |
| network_stats | true |
| num_io_in | true |
| num_io_out | true |
| num_major_page_faults | true |
| num_minor_page_faults | true |
| page_rss | true |
| papi_array<12ul> | true |
| papi_vector | true |
| peak_rss | true |
| perfetto_trace | false |
| priority_context_switch | true |
| process_cpu_clock | true |
| process_cpu_util | true |
| read_bytes | true |
| read_char | true |
| roctx_marker | false |
| system_clock | true |
| tau_marker | false |
| thread_cpu_clock | true |
| thread_cpu_util | true |
| timestamp | true |
| trip_count | true |
| user_clock | true |
| user_mode_time | true |
| virtual_memory | true |
| voluntary_context_switch | true |
| vtune_event | false |
| vtune_frame | false |
| vtune_profiler | false |
| wall_clock | true |
| written_bytes | true |
| written_char | true |
| roctracer | true |
| rocprofiler | true |
| sampling_wall_clock | true |
| sampling_cpu_clock | true |
| sampling_percent | true |
| sampling_gpu_power | true |
| sampling_gpu_temp | true |
| sampling_gpu_busy | true |
| sampling_gpu_memory_usage | true |
|---------------------------------|---------------|
```
## Execution in Sampling mode
```
omnitrace-sample --help
...
Options:
-h, -?, --help Shows this page
[DEBUG OPTIONS]
--monochrome Disable colorized output
--debug Debug output
-v, --verbose Verbose output
[GENERAL OPTIONS]
-c, --config Configuration file
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
-T, --trace Generate a detailed trace (perfetto output)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile)
-F, --flat-profile Generate a flat profile (conflicts with --profile)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
[TRACING OPTIONS]
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
--trace-buffer-size Size limit for the trace output (in KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data
[PROFILE OPTIONS]
--profile-format [ console | json | text ]
Data formats for profiling results
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
--process-freq Set the default host/device sampling frequency (number of interrupts per second)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
--process-duration Set the duration of the host/device sampling (in seconds of realtime)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges
--gpus GPU IDs for SMI queries. Supports integers and/or ranges
[GENERAL SAMPLING OPTIONS]
-f, --freq Set the default sampling frequency (number of interrupts per second)
-w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
-d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value.
[SAMPLING TIMER OPTIONS]
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
0. Enables sampling based on real-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not.
[BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation)
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends
[HARDWARE COUNTER OPTIONS]
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
performance.
Values:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion
```
```
srun -n 1 --gpus 1 omnitrace-sample -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[609.773] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][107567][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(322609988575074:322609987794326) delta=-780748, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] finalizing...
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567 : 1.225706 sec wall_clock, 218.244 MB peak_rss, 211.304 MB page_rss, 0.210000 sec cpu_clock, 17.1 % cpu_util [laps: 1]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567/thread/0 : 1.221656 sec wall_clock, 0.209007 sec thread_cpu_clock, 17.1 % thread_cpu_util, 217.820 MB peak_rss [laps: 2]
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][107567][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-10_15.28/perfetto-trace.proto' (23.16 KB / 0.02 MB / 0.00 GB)... Done
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.json'
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.txt'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.json'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.txt'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.json'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.txt'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.json'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.txt'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.json'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.txt'
[omnitrace][107567][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/metadata-107567.json' and 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/functions-107567.json'
[omnitrace][107567][0][omnitrace_finalize] Finalized
[611.040] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
```

```
srun -n 1 --gpus 1 omnitrace-sample -H -D -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[619.733] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][9441][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(642619960562938:642619959822068) delta=-740870, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] finalizing...
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441 : 1.234547 sec wall_clock, 218.056 MB peak_rss, 211.100 MB page_rss, 0.280000 sec cpu_clock, 22.7 % cpu_util [laps: 1]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441/thread/0 : 1.229435 sec wall_clock, 0.211211 sec thread_cpu_clock, 17.2 % thread_cpu_util, 217.644 MB peak_rss [laps: 2]
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][9441][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-18_22.52/perfetto-trace.proto' (112.13 KB / 0.11 MB / 0.00 GB)... Done
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.json'
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.txt'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.json'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.txt'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.json'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.txt'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.json'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.txt'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.json'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.txt'
[omnitrace][9441][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/metadata-9441.json' and 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/functions-9441.json'
[omnitrace][9441][0][omnitrace_finalize] Finalized
[621.025] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
```

Declaring the metrics in the command
```
srun -n 1 --gpus 1 omnitrace-sample -TPHD -G "GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0" -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[681.799] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][31487][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(646682055581903:646682054882615) delta=-699288, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] finalizing...
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487 : 1.266008 sec wall_clock, 224.308 MB peak_rss, 217.502 MB page_rss, 0.320000 sec cpu_clock, 25.3 % cpu_util [laps: 1]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487/thread/0 : 1.259163 sec wall_clock, 0.236662 sec thread_cpu_clock, 18.8 % thread_cpu_util, 223.840 MB peak_rss [laps: 2]
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][31487][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-19_00.00/perfetto-trace.proto' (218.72 KB / 0.22 MB / 0.00 GB)... Done
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.json'
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.txt'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.json'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.txt'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.json'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.txt'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.json'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.txt'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.json'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.txt'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.json'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.txt'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.json'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.txt'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.json'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.txt'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.json'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.txt'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.json'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.txt'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.json'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.txt'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.json'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.txt'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.json'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.txt'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.json'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.txt'
[omnitrace][31487][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/metadata-31487.json' and 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/functions-31487.json'
[omnitrace][31487][0][omnitrace_finalize] Finalized
[683.153] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
```

## Selective Instrumentation