salloc
salloc -N 1 --ntasks=1 --partition=gpu-dev --gpus=1 -A XXX --time=00:35:00
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description
srun -n 1 omnitrace-avail -G omnitrace.cfg --all
export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg
cp -r /your_path/exercises/HPCTrainingExamples/ .
cd HPCTrainingExamples/HIP/jacobi
make clean;make -f Makefile.cray
Jacobi_hip
Now execute the binary
time srun -n 1 --gpus 1 Jacobi_hip -g 1 1
Check the duration
time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1
and check the durationnm --demangle Jacobi_hip | egrep -i ' (t|u) '
srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
Allocation: salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_465000532 --time=00:35:00
Binary rewriting: srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip
or
srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip
We created a new instrumented binary called jacobi.inst
Activating profiling:
OMNITRACE_PROFILE = true
Executing the new instrumented binary: time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1
and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt
See the list of the instrumented CPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt
or wallclock-1.txt
Check the MPI calls
perfetto-trace.proto
to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto.Edit your omnitrace.cfg:
Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto.
srun -n 1 --gpus 1 omnitrace-avail --all
OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1
and copy the perfetto file and visualizeomnitrace-binary-output/timestamp/wall_clock.txt
(replace binary and timestamp with your information)OMNITRACE_USE_TIMEMORY = true
and OMNITRACE_FLAT_PROFILE = true
, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt
cp -r /project/project_465000532/exercises/causal .
We use an example, you can see the source code here: .causal/source
Basically this tool, .causal/causal-cpu-omni
creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one.
Execute:
Read the script .causal/run-causal-demo.sh
We create a new cfg file called causal.cfg
to activate the causal profiling and we declare it in OMNITRACE_CONFIG_FILE
We define virtual speedups like: SPEEDUPS="0,0,10,20-40:5,50,60-90:15"
, this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%.
First call:
We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m).
As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func.
In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the experiment.line
file.
The major difference here is the option -e
which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups.
Run the run-causal-demo.sh
(it can take some time), then a directory omnitrace-output/
is created with subdirectories.
Need to install in a location: pip install omnitrace-causal-viewer
You can download this on your laptop (or ssh forward to LUMI) and execute the command:
omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/
The results are already located in .causal/results
You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files.
Run srun -n 1 --gpus 1 omniperf profile -h
to see all the options
Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it
If you want to only roofline analysis, then execute: srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
When you use the --roof-only
option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8
If you want to create another PDF file with the kernel names, then add in the above command the --kernel-names
this will create another PDF file with the markers and their names.
There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt
But you need to know the code of the IP Block
-k 0
where 0 is the id
of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.