You would need to replace the project accounts with the ones that are for your system # HIP Exercises `git clone https://github.com/amd/HPCTrainingExamples.git` We assume that you have already allocated resources with `salloc` `cp -r /project/project_462000125/exercises/AMD/HPCTrainingExamples/ .` `salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_462000125` ``` module load craype-accel-amd-gfx90a module load PrgEnv-amd module load rocm ``` ## Hipify We’ll use the same HPCTrainingExamples that were downloaded for the first exercise. Get a node allocation. ``` salloc -N 1 --ntasks=1 --gpus=1 -p small-g -A project_462000125 --t 00:10:00` ``` A batch version of the example is also shown. ### Hipify Examples #### Exercise 1: Manual code conversion from CUDA to HIP (10 min) Choose one or more of the CUDA samples in `HPCTrainingExamples/HIPIFY/mini-nbody/cuda` directory. Manually convert it to HIP. Tip: for example, the cudaMalloc will be called hipMalloc. Some code suggestions include `nbody-block.cu, nbody-orig.cu, nbody-soa.cu` You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture. `hipify-perl nbody-block.cu > nbody-block.cpp` `hipify-perl -inplace nbody-block.cu` Now two files are created/modified, the `nbody-block.cu.prehip` and `nbody-block.cu` #### Exercise 2: Code conversion from CUDA to HIP using HIPify tools (10 min) Use the `hipify-perl` script to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl is in `$ROCM_PATH/hip/bin` directory and should be in your path. First test the conversion to see what will be converted ``` hipify-perl -no-output -print-stats nbody-orig.cu ``` You'll see the statistics of HIP APIs that will be generated. ``` [HIPIFY] info: file 'nbody-orig.cu' statisitics: CONVERTED refs count: 10 TOTAL lines of code: 91 WARNINGS: 0 [HIPIFY] info: CONVERTED refs by names: cudaFree => hipFree: 1 cudaMalloc => hipMalloc: 1 cudaMemcpy => hipMemcpy: 2 cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1 cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1 ``` `hipify-perl` is in `$ROCM_PATH/hip/bin` directory and should be in your path. In some versions of ROCm, the script is called `hipify-perl`. Now let's actually do the conversion. ``` hipify-perl nbody-orig.cu > nbody-orig.cpp ``` Compile the HIP programs. ``` hipcc --offlod-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig` ``` The `#define SHMOO` fixes some timer printouts. Add `--offload-arch=<gpu_type>` to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node. * Fix any compiler issues, for example, if there was something that didn’t hipify correctly. * Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX. Run the program ``` srun ./nbody-orig ``` A batch version of Exercise 2 is: ``` #!/bin/bash #SBATCH -N 1 #SBATCH --ntasks=1 #SBATCH --gpus=1 #SBATCH -p small-g #SBATCH -A project_462000125 #SBATCH -t 00:10:00 module load craype-accel-amd-gfx90a module load rocm cd HPCTrainingExamples/mini-nbody/cuda hipify-perl -print-stats nbody-orig.cu > nbody-orig.cpp hipcc --offload-arch=gfx90a -DSHMOO -I ../ nbody-orig.cpp -o nbody-orig srun ./nbody-orig cd ../../.. ``` Notes: * Hipify tools do not check correctness * `hipconvertinplace-perl` is a convenience script that does `hipify-perl -inplace -print-stats` command in a directory ## Omnitools: Performance Analysis Tools for AMD GPUs ### Omnitrace Note: Just installed Omnitrace on /scratch/omnitrace but not tested yet. `source /scratch/omnitrace/share/omnitrace/setup-env.sh` * Load Omnitrace ``` module load LUMI/22.08 partition/G rocm/5.3.3 module use /project/project_465000532/software/omnitrace_rocm533/share/modulefiles/ module load omnitrace/1.10.0 ``` * Allocate resources with `salloc` `salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:35:00` * Check the various options and their values and also a second command for description `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace` `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description` * Create an Omnitrace configuration file with description per option `srun -n 1 omnitrace-avail -G omnitrace.cfg --all` * Declare to use this configuration file: `export OMNITRACE_CONFIG_FILE=/path/omnitrace.cfg` * Get the training examples: `cp -r /project/project_462000125/exercises/HPCTrainingExamples/ .` <!-- * Compile and execute saxpy * `cd HPCTrainingExamples/HIP/saxpy` * `hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.cpp` * `time srun -n 1 ./saxpy` * Check the duration --> * Compile and execute Jacobi * `cd HPCTrainingExamples/HIP/jacobi` * `make clean;make` * Binary `jacobi_hip` <!-- * Need to make some changes to the makefile * ``MPICC=$(PREP) `which CC` `` * `MPICFLAGS+=$(CFLAGS) -I${CRAY_MPICH_PREFIX}/include` * `MPILDFLAGS+=$(LDFLAGS) -L${CRAY_MPICH_PREFIX}/lib -lmpich` * comment out * ``# $(error Unknown MPI version! Currently can detect mpich or open-mpi)`` --> * Now execute the binary * `time srun -n 1 --gpus 1 Jacobi_hip -g 1 1` * Check the duration ### Dynamic instrumentation (it will take long time/fail) * Execute dynamic instrumentation: `time srun -n 1 --gpus 1 omnitrace-instrument -- Jacobi_hip -g 1 1` and check the duration <!-- * Execute dynamic instrumentation: `time srun -n 1 --gpus 1 omnitrace-instrument -- ./Jacobi_hip -g 1 1` and check the duration (may fail?) --> * About Jacobi example, as the dynamic instrumentation wuld take long time, check what the binary calls and gets instrumented: `nm --demangle Jacobi_hip | egrep -i ' (t|u) '` * Available functions to instrument (**it can take long time**): `srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1` * the simulate option means that it will not execute the binary ### Binary rewriting (to be used with MPI codes and decreases overhead) * Allocation: `salloc -N 1 --ntasks=2 --partition=small-g --gpus=2 -A project_462000125 --time=00:35:00` * Binary rewriting: `srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip` or ` srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -- ./Jacobi_hip` * We created a new instrumented binary called jacobi.inst * Executing the new instrumented binary: `time srun -n 2 --gpus 2 omnitrace-run -- ./jacobi.inst -g 2 1` and check the duration * See the list of the instrumented GPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer-0.txt` * See the list of the instrumented CPU calls: `cat omnitrace-jacobi.inst-output/TIMESTAMP/wallclock-0.txt` or wallclock-1.txt * Check the MPI calls ### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file perfetto-trace-0.proto or perfetto-trace-1.proto. * Where are all the MPI calls? * If MPI call is not called from the main call-stack then you need to profile the call-stack (default 64 layers) ### Call-stack Edit your omnitrace.cfg: ``` OMNITRACE_USE_SAMPLING = true;  OMNITRACE_SAMPLING_FREQ = 100 ``` Execute again the instrumented binary and now you can see the call-stack when you visualize with perfetto. ### Hardware counters * See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all` * Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1` and copy the perfetto file and visualize ### Kernel timings * Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt` ### Causal Profiling `cp -r /project/project_462000125/exercises/causal .` We use an example, you can see the source code here: `.causal/source` Basically this tool, `.causal/causal-cpu-omni` creates two threads and can run a fast function and a slow one. We can define the fraction of the fast routine compared to the slow one. By default fast routine takes 70% of the slow one. Execute: ``` ./causal-cpu-omni ./causal-cpu-omni 50 ``` Read the script `.causal/run-causal-demo.sh` * We create a new cfg file called `causal.cfg` to activate the causal profiling and we declare it in `OMNITRACE_CONFIG_FILE` * We define virtual speedups like: `SPEEDUPS="0,0,10,20-40:5,50,60-90:15"`, this means that the profiling will investigate no speedup, 0%, then 10%, 20%, 25%, 30%, 35%, 40%,50,60%,75%,90%. So when we define 20-40:5 means range 20-40 and increase 5%. * First call: ``` omnitrace-causal \ ${RESET} \ -n 5 \ -s ${SPEEDUPS} \ -m func \ -- \ ./causal-cpu-omni "${@}" ``` We do 5 executions, testing the various speedups from (-s) based on the defined functions (-m). * Second call: ``` omnitrace-causal \ ${RESET} \ -n 10 \ -s ${SPEEDUPS} \ -m func \ -S "causal.cpp" \ -o experiment.func \ -- \ ./causal-cpu-omni "${@}" ``` As we want to focus on specific files and not other ones, we use the -S option (source scope) and the data will be written in the output file experiment.func. * Third call: ``` omnitrace-causal \ ${RESET} \ -n 10 \ -s ${SPEEDUPS} \ -m line \ -S "causal.cpp" \ -F "cpu_(slow|fast)_func" \ -o experiment.line \ -- \ ./causal-cpu-omni "${@}" ``` In this case we have 10 executions, again with speedups, but in line mode, same file, using either cpu_slow_func or cpu_fast_func and the experiment data are written in the `experiment.line` file. * Fourth call: ``` omnitrace-causal \ ${RESET} \ -n 2 \ -s ${SPEEDUPS} \ -m line \ -S "causal.cpp" \ -F "cpu_slow_func" "cpu_fast_func" \ -o experiment.line.e2e \ -e \ -- \ ./causal-cpu-omni "${@}" ``` The major difference here is the option `-e` which means end to end, so there will be one virtual speedup across the whole execution and thus, it can take a lot of executions to evaluate all the virtual speedups. Run the `run-causal-demo.sh` (it can take some time), then a directory `omnitrace-output/` is created with subdirectories. Need to install in a location: `pip install omnitrace-causal-viewer` You can download this on your laptop (or ssh forward to LUMI) and execute the command: `omnitrace-causal-plot -w omnitrace-output/causal-cpu-omni/causal/` The results are already located in `.causal/results` You can visualize the results also from the web page: https://plasma-umass.org/coz/ by loading the json files. ## Omniperf * On uan1: ``` module load cray-python export PATH=/scratch/omniperf/1.0.8/bin/:$PATH export PYTHONPATH=/scratch/omniperf/python:$PYTHONPATH ``` * Load Omniperf: ``` module load cray-python module load LUMI/22.08 partition/G rocm/5.3.3 module use /project/project_465000532/software/omniperf_rocm533/modulefiles/ module load omniperf/1.0.8 ``` * Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed ``` salloc -N 1 --ntasks=1 --partition=small-g --gpus=1 -A project_462000125 --time=00:30:00 cp -r /project/project_465000532/exercises/HPCTrainingExamples/ . cd HPCTrainingExamples/HIP/dgemm/ mkdir build cd build cmake .. make cd bin srun -n 1 --gpus 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv ``` ### Execution with all the analysis * Run `srun -n 1 --gpus 1 omniperf profile -h` to see all the options * Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` ### Roofline only execution * If you want to only roofline analysis, then execute: `srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv` * When you use the `--roof-only` option, two PDF files will be create for the roofline, one for FP32/64 and one for FP8 * If you want to create another PDF file with the kernel names, then add in the above command the `--kernel-names` this will create another PDF file with the markers and their names. ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file `dgemm_analyze.txt` * We can select specific IP Blocks, like: ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2 ``` But you need to know the code of the IP Block ### Visualize * If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute: ``` omniperf analyze -p workloads/dgemm/mi200/ --gui ``` * Open the web page: http://IP:8050/ The IP will be displayed in the output ### Too large profiling data * Use a tool like rocprof to get the top 10 kernels * Then in the above commands add the option `-k 0` where 0 is the `id` of the top 10 kernels or you can add the name of the kernel only. This way you analyze only a specific kernel. This has to be done 10 times to analyze 10 kernels and visualize each of them separate.