https://github.com/AMDResearch/omnitrace/releases
Full documentation: https://amdresearch.github.io/omnitrace/
source omnitrace_installation_path/share/omnitrace/setup-env.sh
omnitrace --help
omnitrace-avail --help
Note: There is a chance to need to declare LD_LIBRARY_PATH, do not forget to load ROCm module
omnitrace <omnitrace-options> -- <exe> [<exe-options>]
omnitrace-sample <omnitrace-options> -- <exe> [<exe-options>]
omnitrace <omnitrace-options> -o <name-of-new-exe-or-library> -- <exe-or-library>
git clone https://github.com/ROCm-Developer-Tools/HIP.git
cd HIP/samples/2_Cookbook/0_MatrixTranspose
Let's see the main content of the
// Device (Kernel) function, it must be void
__global__ void matrixTranspose(float* out, float* in, const int width) {
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
out[y * width + x] = in[x * width + y];
}
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
std::cout << "Device name " << devProp.name << std::endl;
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);`
// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix,
gpuMatrix, WIDTH);
// Memory transfer from device to host
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
// CPU MatrixTranspose computation
matrixTransposeCPUReference(cpuTransposeMatrix, Matrix, WIDTH);`
salloc -N 1 -A project_462000075 -p gpu -n 1 --gpus 1 -t 01:00:00
module load rocm/5.0.2
Comment the execution of test in the line 41 in the Makefile to avoid executing the binary:
all: $(EXECUTABLE) test
-> all: $(EXECUTABLE) #test
Compile
make
Execute
srun -n 1 ./MatrixTranspose
Error: srun -n 1 ./MatrixTranspose "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" srun: error: nid005660: task 0: Aborted srun: launch/slurm: _step_signal: Terminating StepId=1731160.5
--offload-arch=gfx90a
, so in the Makefile convert the CXXFLAGS =-g
to CXXFLAGS =-g --offload-arch=gfx90a
. You can add also other variable like HIPFLAGS
and adjust the Makefilesrun -n 1 make
It is better to define the flags.If you compile with Cray compiler. loading the module craype-accel-amd-gfx90a
is enough.
srun -n 1 ./MatrixTranspose PASSED!
srun -n 1 omnitrace-avail --categories omnitrace --brief --description
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)…
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
Learn More →
srun -n 1 omnitrace-avail --categories omnitrace
Learn More →
All the options with some default settings
srun -n 1 omnitrace-avail -G omnitrace.cfg
All the options with description included
srun -n 1 omnitrace-avail -G omnitrace.cfg --all
In order to define which configuration file will be used, declare:
export OMNITRACE_CONFIG_FILE=/path/name.cfg
srun -n 1 omnitrace --help
[omnitrace][exe]
[omnitrace][exe] command :: ''...
[omnitrace][exe]
[omnitrace] Usage: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/bin/omnitrace
[ --help (count: 0, dtype: bool)
--verbose (max: 1, dtype: bool)
--error (max: 1, dtype: boolean)
--debug (max: 1, dtype: bool)
--log (count: 1)
--log-file (count: 1)
--simulate (max: 1, dtype: bool)
--print-format (min: 1, dtype: string)
--print-dir (count: 1, dtype: string)
--print-available (count: 1)
--print-instrumented (count: 1)
--print-coverage (count: 1)
--print-excluded (count: 1)
--print-overlapping (count: 1)
--print-instructions (max: 1, dtype: bool)
--output (min: 0, dtype: string)
--pid (count: 1, dtype: int)
--mode (count: 1)
--force (max: 1, dtype: bool)
--command (count: 1)
--prefer (count: 1)
--library (count: unlimited)
--main-function (count: 1)
--load (count: unlimited, dtype: string)
--load-instr (count: unlimited, dtype: filepath)
--init-functions (count: unlimited, dtype: string)
--fini-functions (count: unlimited, dtype: string)
--all-functions (max: 1, dtype: bool)
--function-include (count: unlimited)
--function-exclude (count: unlimited)
--function-restrict (count: unlimited)
--module-include (count: unlimited)
--module-exclude (count: unlimited)
--module-restrict (count: unlimited)
--label (count: unlimited, dtype: string)
--config (min: 1, dtype: string)
--default-components (count: unlimited, dtype: string)
--env (count: unlimited)
--mpi (max: 1, dtype: bool)
--instrument-loops (max: 1, dtype: boolean)
--min-instructions (count: 1, dtype: int)
--min-address-range (count: 1, dtype: int)
--min-instructions-loop (count: 1, dtype: int)
--min-address-range-loop (count: 1, dtype: int)
--coverage (max: 1, dtype: bool)
--dynamic-callsites (max: 1, dtype: boolean)
--traps (max: 1, dtype: bool)
--loop-traps (max: 1, dtype: bool)
--allow-overlapping (max: 1, dtype: bool)
--parse-all-modules (max: 1, dtype: bool)
--batch-size (count: 1, dtype: int)
--dyninst-rt (min: 1, dtype: filepath)
--dyninst-options (count: unlimited)
] -- <CMD> <ARGS>
srun -n 1 --gpus 1 omnitrace -- ./MatrixTranspose
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace17/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] 2 instrumented funcs in libamd_comgr.so.2.4.50002
[omnitrace][exe] 3 instrumented funcs in libamdhip64.so.5.0.50002
[omnitrace][exe] 15 instrumented funcs in libhsa-runtime64.so.1.5.50002
[omnitrace][exe] 2 instrumented funcs in libtinfo.so.6.1
[omnitrace][exe] 3 instrumented funcs in libz.so.1.2.11
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.52/instrumentation/overlapping.txt'... Done
[omnitrace][exe] Executing...
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[706.292] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][118151][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(255706649483053:255706648575215) delta=-907838, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] finalizing...
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151 : 0.364673 sec wall_clock, 237.052 MB peak_rss, 230.597 MB page_rss, 0.330000 sec cpu_clock, 90.6 % cpu_util [laps: 1]
[omnitrace][118151][0][omnitrace_finalize] omnitrace/process/118151/thread/0 : 0.364280 sec wall_clock, 0.289480 sec thread_cpu_clock, 79.5 % thread_cpu_util, 237.028 MB peak_rss [laps: 1]
[omnitrace][118151][0][omnitrace_finalize]
[omnitrace][118151][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][118151][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-16_22.53/perfetto-trace.proto' (57.18 KB / 0.06 MB / 0.00 GB)... Done
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.json'
[omnitrace][118151][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_util.txt'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.json'
[omnitrace][118151][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/cpu_clock.txt'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.json'
[omnitrace][118151][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/page_rss.txt'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.json'
[omnitrace][118151][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/roctracer.txt'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.json'
[omnitrace][118151][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/wall_clock.txt'
[omnitrace][118151][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/metadata-118151.json' and 'omnitrace-MatrixTranspose-output/2022-10-16_22.53/functions-118151.json'
[omnitrace][118151][0][omnitrace_finalize] Finalized
[706.822] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
[omnitrace][exe] End of omnitrace
In this case we have more than 120x times profiling overheard.
nm --demangle MatrixTranspose | egrep -i ' (t|u) '
000000000020d080 t _GLOBAL__sub_I_MatrixTranspose.cpp
000000000020c970 T __device_stub__warmup()
000000000020ca40 T matrixTransposeCPUReference(float*, float*, unsigned int)
000000000020c9c0 T __device_stub__matrixTranspose(float*, float*, int)
U std::ctype<char>::_M_widen_init() const
U std::ostream::put(char)
U std::ostream::flush()
U std::ios_base::Init::Init()
U std::ios_base::Init::~Init()
U std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
U std::__throw_bad_cast()
U __cxa_atexit
000000000020c930 t __do_global_dtors_aux
U __hipPopCallConfiguration
U __hipPushCallConfiguration
U __hipRegisterFatBinary
U __hipRegisterFunction
U __hipUnregisterFatBinary
000000000020cfd0 t __hip_module_ctor
000000000020d060 t __hip_module_dtor
000000000020d12e T __libc_csu_fini
000000000020d0ae T __libc_csu_init
U __libc_start_main
000000000020d178 t _fini
000000000020d160 t _init
000000000020c890 T _start
000000000020d14e t atexit
000000000020c8c0 t deregister_tm_clones
000000000020c960 t frame_dummy
U free
U hipFree
U hipGetDeviceProperties
U hipLaunchKernel
U hipMalloc
U hipMemcpy
000000000020cb00 T main
U malloc
U printf
U puts
000000000020c8f0 t register_tm_clones
U strlen
With binary mode instrumentation only the functions under text are instrumented.
Try: srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose
srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -I 'function_name1' 'function_name2' -- ./MatrixTranspose
srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -E 'function_name1' 'function_name2' -- ./MatrixTranspose
--simulate
to activate simulation mode and the option --print-available functions
to print the available functions for instrumentation and their cycles. For such option we need to rewrite the binary with -o name
.srun -n 1 omnitrace -v -1 –simulate –print-available functions -o matrix.inst – ./MatrixTranspose
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[available] ../sysdeps/x86_64/crti.S:
[available] [_fini][3]
[available] [_init][7]
[available] ../sysdeps/x86_64/start.S:
[available] [_start][12]
[available] MatrixTranspose:
[available] [__device_stub__matrixTranspose][26]
[available] [__device_stub__warmup][17]
[available] [__do_global_dtors_aux][9]
[available] [__hip_module_ctor][33]
[available] [__hip_module_dtor][8]
[available] [frame_dummy][4]
[available] [targ20d132][1]
[available] MatrixTranspose.cpp:
[available] [_GLOBAL__sub_I_MatrixTranspose.cpp][8]
[available] [main][278]
[available] [matrixTransposeCPUReference][51]
[available] atexit.c:
[available] [atexit][5]
[available] elf-init.c:
[available] [__libc_csu_fini][3]
[available] [__libc_csu_init][36]
Note: in the brackets [] is the number of the cycles.
We care for the functions in the MatrixTrasnspose.cpp and the minimum sycles is 8. We can define the number of the minimum sycles with the flag -i 8
, thus we can try the command:
srun -n 1 omnitrace -v -l --simulate --print-instrumented functions -i 8 -o matrix.inst -- ./MatrixTranspose
[omnitrace][exe]
[omnitrace][exe] command :: '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose'...
[omnitrace][exe]
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] instrumentation target: /pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose
[omnitrace][exe] Opening '/pfs/lustrep4/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe]
[omnitrace][exe] Found 16 functions in 6 modules in instrumentation target
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[omnitrace][exe] function: 'main' ... found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] loading library: '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/pfs/lustrep4/scratch/project_462000075/markoman/omnitrace_install/lib/libomnitrace-dl.so.1.7.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe] [function][Loop Instrumenting] no-constraint :: 'matrixTransposeCPUReference'...
[omnitrace][exe] 1 instrumented funcs in MatrixTranspose.cpp
[omnitrace][exe] 1 instrumented loops in MatrixTranspose.cpp
[omnitrace][exe]
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/available.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/instrumented.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/excluded.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.json'... Done
[omnitrace][exe] Outputting 'omnitrace-matrix.inst-output/instrumentation/overlapping.txt'... Done
[instrumented] MatrixTranspose.cpp:
[instrumented] [main][278]
[instrumented] [matrixTransposeCPUReference][51]
We used the option --print-instrumented functions
to print the instrumented functions and now only two functions are instrumented.
Let's create the matrix.inst
file:
srun -n 1 --gpus 1 omnitrace -i 8 -o matrix.inst -- ./MatrixTranspose
If we execute now the new binary, called matrix.inst
:
srun -n 1 --gpus 1 ./matrix.inst
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[364.879] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
PASSED!
[omnitrace][128457][0][omnitrace_finalize] finalizing...
[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457 : 0.342078 sec wall_clock, 219.956 MB peak_rss, 212.865 MB page_rss, 0.620000 sec cpu_clock, 181.2 % cpu_util [laps: 1]
[omnitrace][128457][0][omnitrace_finalize] omnitrace/process/128457/thread/0 : 0.342018 sec wall_clock, 0.300716 sec thread_cpu_clock, 87.9 % thread_cpu_util, 219.956 MB peak_rss [laps: 1]
[omnitrace][128457][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][128457][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-matrix.inst-output/perfetto-trace.proto' (993.07 KB / 0.99 MB / 0.00 GB)... Done
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.json'
[omnitrace][128457][roctracer]> Outputting 'omnitrace-matrix.inst-output/roctracer.txt'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.json'
[omnitrace][128457][wall_clock]> Outputting 'omnitrace-matrix.inst-output/wall_clock.txt'
[omnitrace][128457][manager::finalize][metadata]> Outputting 'omnitrace-matrix.inst-output/metadata.json' and 'omnitrace-matrix.inst-output/functions.json'
[omnitrace][128457][0][omnitrace_finalize] Finalized
[365.282] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
Now the overhead is just around 50% compared to over 100x times as before.
A new directory called omnitrace-matrix.inst-output
created with performance analysis data. We can read some of the text files.
omnitrace-matrix.inst-output/wall_clock.txt
omnitrace-matrix.inst-output/perfetto-trace.proto
You can zoom with the keystroke W, unzoom with S, move left with A and move right with D.
fore.
Note: We van observe a really long hipmemcpy
thew reason for this is that if the GPU is not utilized, the first utilization has significant overhead.
We add in the code and execute an empty kernel, such as:
__global__ void warmup() {}
...
warmup<<<1,64>>>();
Now the duration of the hipMemcpy
is as it is expected.
Omnitrace uses a configuration file so the user can declare what options would like to use.
omnitrace-avail -G omnitrace.cfg
omnitrace-avail -G omnitrace.cfg --all
In order to declare which configuration file to use, declare: export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg
or whatever name you have declared.
The default configuration files for v1.7, looks like:
OMNITRACE_CONFIG_FILE =
OMNITRACE_USE_PERFETTO = true
OMNITRACE_USE_TIMEMORY = true
OMNITRACE_USE_SAMPLING = false
OMNITRACE_USE_PROCESS_SAMPLING = true
OMNITRACE_USE_ROCTRACER = true
OMNITRACE_USE_ROCM_SMI = true
OMNITRACE_USE_KOKKOSP = false
OMNITRACE_USE_PID = false
OMNITRACE_USE_RCCLP = false
OMNITRACE_USE_ROCPROFILER = true
OMNITRACE_USE_ROCTX = false
OMNITRACE_OUTPUT_PATH = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX =
OMNITRACE_CRITICAL_TRACE = false
OMNITRACE_PERFETTO_BACKEND = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB = 1024000
OMNITRACE_PERFETTO_FILL_POLICY = discard
OMNITRACE_PROCESS_SAMPLING_DURATION = -1
OMNITRACE_PROCESS_SAMPLING_FREQ = 0
OMNITRACE_ROCM_EVENTS =
OMNITRACE_SAMPLING_CPUS = 0-3
OMNITRACE_SAMPLING_DELAY = 0.5
OMNITRACE_SAMPLING_DURATION = 0
OMNITRACE_SAMPLING_FREQ = 100
OMNITRACE_SAMPLING_GPUS = all
OMNITRACE_TIME_OUTPUT = false
OMNITRACE_TIMEMORY_COMPONENTS = wall_clock,cpu_clock,page_rss,cpu_util,papi_vector
OMNITRACE_VERBOSE = 0
OMNITRACE_ENABLED = true
OMNITRACE_PAPI_EVENTS =
OMNITRACE_SUPPRESS_CONFIG = false
OMNITRACE_SUPPRESS_PARSING = false
/proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)
then you can not have access to HW counters before a system admin change the value.srun -n 1 --gpus 1 omnitrace-avail --all
...
| GPU | | |
|---------------------------------------|--------------|---------------------------------------|
| SQ_INSTS_VMEM_WR:device=0 | true | Number of VMEM write instructions ... |
| SQ_INSTS_VMEM_RD:device=0 | true | Number of VMEM read instructions i... |
| SQ_INSTS_SALU:device=0 | true | Number of SALU instructions issued... |
| SQ_INSTS_SMEM:device=0 | true | Number of SMEM instructions issued... |
| SQ_INSTS_FLAT:device=0 | true | Number of FLAT instructions issued... |
| SQ_INSTS_FLAT_LDS_ONLY:device=0 | true | Number of FLAT instructions issued... |
| SQ_INSTS_LDS:device=0 | true | Number of LDS instructions issued ... |
| SQ_INSTS_GDS:device=0 | true | Number of GDS instructions issued.... |
| SQ_WAIT_INST_LDS:device=0 | true | Number of wave-cycles spent waitin... |
| SQ_ACTIVE_INST_VALU:device=0 | true | regspec 71? Number of cycles the S... |
| SQ_INST_CYCLES_SALU:device=0 | true | Number of cycles needed to execute... |
| SQ_THREAD_CYCLES_VALU:device=0 | true | Number of thread-cycles used to ex... |
| SQ_LDS_BANK_CONFLICT:device=0 | true | Number of cycles LDS is stalled by... |
| TCC_HIT[0]:device=0 | true | Number of cache hits. |
| TCC_HIT[1]:device=0 | true | Number of cache hits. |
| TCC_HIT[2]:device=0 | true | Number of cache hits. |
| TCC_HIT[3]:device=0 | true | Number of cache hits. |
| TCC_HIT[4]:device=0 | true | Number of cache hits. |
| TCC_HIT[5]:device=0 | true | Number of cache hits. |
| TCC_HIT[6]:device=0 | true | Number of cache hits. |
...
OMNITRACE_ROCM_EVENTS = GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0
The 0 is for the GPU 0, adjust as you wish based on the metrics and which GPUs to use the metrics for
omnitrace-avail -C
|---------------------------------|---------------|
| COMPONENT | AVAILABLE |
|---------------------------------|---------------|
| allinea_map | false |
| caliper_marker | false |
| caliper_config | false |
| caliper_loop_marker | false |
| cpu_clock | true |
| cpu_util | true |
| craypat_counters | false |
| craypat_flush_buffer | false |
| craypat_heap_stats | false |
| craypat_record | false |
| craypat_region | false |
| current_peak_rss | true |
| gperftools_cpu_profiler | false |
| gperftools_heap_profiler | false |
| hip_event | false |
| kernel_mode_time | true |
| likwid_marker | false |
| likwid_nvmarker | false |
| malloc_gotcha | true |
| memory_allocations | true |
| monotonic_clock | true |
| monotonic_raw_clock | true |
| network_stats | true |
| num_io_in | true |
| num_io_out | true |
| num_major_page_faults | true |
| num_minor_page_faults | true |
| page_rss | true |
| papi_array<12ul> | true |
| papi_vector | true |
| peak_rss | true |
| perfetto_trace | false |
| priority_context_switch | true |
| process_cpu_clock | true |
| process_cpu_util | true |
| read_bytes | true |
| read_char | true |
| roctx_marker | false |
| system_clock | true |
| tau_marker | false |
| thread_cpu_clock | true |
| thread_cpu_util | true |
| timestamp | true |
| trip_count | true |
| user_clock | true |
| user_mode_time | true |
| virtual_memory | true |
| voluntary_context_switch | true |
| vtune_event | false |
| vtune_frame | false |
| vtune_profiler | false |
| wall_clock | true |
| written_bytes | true |
| written_char | true |
| roctracer | true |
| rocprofiler | true |
| sampling_wall_clock | true |
| sampling_cpu_clock | true |
| sampling_percent | true |
| sampling_gpu_power | true |
| sampling_gpu_temp | true |
| sampling_gpu_busy | true |
| sampling_gpu_memory_usage | true |
|---------------------------------|---------------|
omnitrace-sample --help
...
Options:
-h, -?, --help Shows this page
[DEBUG OPTIONS]
--monochrome Disable colorized output
--debug Debug output
-v, --verbose Verbose output
[GENERAL OPTIONS]
-c, --config Configuration file
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix
-T, --trace Generate a detailed trace (perfetto output)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile)
-F, --flat-profile Generate a flat profile (conflicts with --profile)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc.
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc.
[TRACING OPTIONS]
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix.
--trace-buffer-size Size limit for the trace output (in KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data
[PROFILE OPTIONS]
--profile-format [ console | json | text ]
Data formats for profiling results
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
--process-freq Set the default host/device sampling frequency (number of interrupts per second)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime)
--process-duration Set the duration of the host/device sampling (in seconds of realtime)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges
--gpus GPU IDs for SMI queries. Supports integers and/or ranges
[GENERAL SAMPLING OPTIONS]
-f, --freq Set the default sampling frequency (number of interrupts per second)
-w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime
-d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value.
[SAMPLING TIMER OPTIONS]
--cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
--realtime Sample based on a real-clock timer. Accepts zero or more arguments:
0. Enables sampling based on real-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not.
[BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation)
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends
[HARDWARE COUNTER OPTIONS]
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`)
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
performance.
Values:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion
srun -n 1 --gpus 1 omnitrace-sample -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[609.773] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][107567][2042][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(322609988575074:322609987794326) delta=-780748, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] finalizing...
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567 : 1.225706 sec wall_clock, 218.244 MB peak_rss, 211.304 MB page_rss, 0.210000 sec cpu_clock, 17.1 % cpu_util [laps: 1]
[omnitrace][107567][0][omnitrace_finalize] omnitrace/process/107567/thread/0 : 1.221656 sec wall_clock, 0.209007 sec thread_cpu_clock, 17.1 % thread_cpu_util, 217.820 MB peak_rss [laps: 2]
[omnitrace][107567][0][omnitrace_finalize]
[omnitrace][107567][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][107567][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-10_15.28/perfetto-trace.proto' (23.16 KB / 0.02 MB / 0.00 GB)... Done
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.json'
[omnitrace][107567][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_util.txt'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.json'
[omnitrace][107567][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/cpu_clock.txt'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.json'
[omnitrace][107567][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/page_rss.txt'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.json'
[omnitrace][107567][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/roctracer.txt'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.json'
[omnitrace][107567][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/wall_clock.txt'
[omnitrace][107567][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/metadata-107567.json' and 'omnitrace-MatrixTranspose-output/2022-10-10_15.28/functions-107567.json'
[omnitrace][107567][0][omnitrace_finalize] Finalized
[611.040] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
srun -n 1 --gpus 1 omnitrace-sample -H -D -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[619.733] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][9441][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(642619960562938:642619959822068) delta=-740870, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] finalizing...
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441 : 1.234547 sec wall_clock, 218.056 MB peak_rss, 211.100 MB page_rss, 0.280000 sec cpu_clock, 22.7 % cpu_util [laps: 1]
[omnitrace][9441][0][omnitrace_finalize] omnitrace/process/9441/thread/0 : 1.229435 sec wall_clock, 0.211211 sec thread_cpu_clock, 17.2 % thread_cpu_util, 217.644 MB peak_rss [laps: 2]
[omnitrace][9441][0][omnitrace_finalize]
[omnitrace][9441][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][9441][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-18_22.52/perfetto-trace.proto' (112.13 KB / 0.11 MB / 0.00 GB)... Done
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.json'
[omnitrace][9441][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_util.txt'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.json'
[omnitrace][9441][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/cpu_clock.txt'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.json'
[omnitrace][9441][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/page_rss.txt'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.json'
[omnitrace][9441][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/roctracer.txt'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.json'
[omnitrace][9441][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/wall_clock.txt'
[omnitrace][9441][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/metadata-9441.json' and 'omnitrace-MatrixTranspose-output/2022-10-18_22.52/functions-9441.json'
[omnitrace][9441][0][omnitrace_finalize] Finalized
[621.025] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
Declaring the metrics in the command
srun -n 1 --gpus 1 omnitrace-sample -TPHD -G "GPUBusy:device=0,Wavefronts:device=0,VALUBusy:device=0,L2CacheHit:device=0,MemUnitBusy:device=0" -- ./MatrixTranspose
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[681.799] perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
Device name
Device name
[omnitrace][31487][2041][hip_activity_callback] 1 :: Discarding kernel roctracer activity record which ended before it started :: CopyHostToDevice :: CopyHostToDevice :: cid=7, time_ns=(646682055581903:646682054882615) delta=-699288, device=0, queue=0, pid=0, tid=0, op=COPY
PASSED!
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] finalizing...
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487 : 1.266008 sec wall_clock, 224.308 MB peak_rss, 217.502 MB page_rss, 0.320000 sec cpu_clock, 25.3 % cpu_util [laps: 1]
[omnitrace][31487][0][omnitrace_finalize] omnitrace/process/31487/thread/0 : 1.259163 sec wall_clock, 0.236662 sec thread_cpu_clock, 18.8 % thread_cpu_util, 223.840 MB peak_rss [laps: 2]
[omnitrace][31487][0][omnitrace_finalize]
[omnitrace][31487][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][31487][perfetto]> Outputting '/scratch/project_462000075/markoman/HIP/samples/2_Cookbook/0_MatrixTranspose/omnitrace-MatrixTranspose-output/2022-10-19_00.00/perfetto-trace.proto' (218.72 KB / 0.22 MB / 0.00 GB)... Done
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.json'
[omnitrace][31487][rocprof-device-0-GPUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-GPUBusy.txt'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.json'
[omnitrace][31487][rocprof-device-0-Wavefronts]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-Wavefronts.txt'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.json'
[omnitrace][31487][rocprof-device-0-VALUBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-VALUBusy.txt'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.json'
[omnitrace][31487][rocprof-device-0-L2CacheHit]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-L2CacheHit.txt'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.json'
[omnitrace][31487][rocprof-device-0-MemUnitBusy]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/rocprof-device-0-MemUnitBusy.txt'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.json'
[omnitrace][31487][cpu_util]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_util.txt'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.json'
[omnitrace][31487][cpu_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/cpu_clock.txt'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.json'
[omnitrace][31487][page_rss]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/page_rss.txt'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.json'
[omnitrace][31487][roctracer]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/roctracer.txt'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.json'
[omnitrace][31487][sampling_gpu_memory_usage]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_memory_usage.txt'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.json'
[omnitrace][31487][sampling_gpu_power]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_power.txt'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.json'
[omnitrace][31487][sampling_gpu_temperature]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_temperature.txt'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.json'
[omnitrace][31487][sampling_gpu_busy_percent]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/sampling_gpu_busy_percent.txt'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.json'
[omnitrace][31487][wall_clock]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/wall_clock.txt'
[omnitrace][31487][metadata]> Outputting 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/metadata-31487.json' and 'omnitrace-MatrixTranspose-output/2022-10-19_00.00/functions-31487.json'
[omnitrace][31487][0][omnitrace_finalize] Finalized
[683.153] perfetto.cc:57383 Tracing session 1 ended, total sessions:0
mpirun -np 1 rocprofv3 --stats --kernel-trace -T -- ./Jacobi_hip -g 1 1
May 5, 2025Load Omnitrace
May 5, 2025Access to LUMI: ssh username@lumi.csc.fi
May 2, 2025MPI Ghost Exchange Optimization ExamplesChanges Between Example VersionsThis code contains several implementations of the same ghost exchange algorithm at varying stages of optimization:
Jun 6, 2024or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up