We assume that you have already allocated resources with salloc
cp -r /projappl/project_465000388/exercises/AMD/HIP-Examples/ .
salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_465000388
cd HIP-Examples/vectorAdd
Examine files here – README, Makefile and vectoradd_hip.cpp Notice that Makefile requires HIP_PATH to be set. Check with module show rocm or echo $HIP_PATH Also, the Makefile builds and runs the code. We’ll do the steps separately. Check also the HIPFLAGS in the Makefile.
We can use SLURM submission script, let's call it hip_batch.sh
:
Submit the script
sbatch hip_batch.sh
Check for output in slurm-<job-id>.out
or error in slurm-<job-id>.err
Compile and run with Cray compiler
Now let’s try the cuda-stream example. This example is from the original McCalpin code as ported to CUDA by Nvidia. This version has been ported to use HIP. See add4 for another similar stream example.
Note that it builds with the hipcc compiler. You should get a report of the Copy, Scale, Add, and Triad cases.
Check that we need to declare target GPU for MI250x, is the --offload-arch=gfx90a
We’ll use the same HIP-Examples that were downloaded for the first exercise Get a node allocation.
Choose one or more of the CUDA samples in HIP-Examples/mini-nbody/cuda repository and manually convert them to HIP. Tip: for example, the cudaMalloc will be called hipMalloc.
Some code suggestions include mini-nbody/cuda/<nbody-block.cu,nbody-orig.cu,nbody-soa.cu>
The CUDA samples are located in HIP-Examples/mini-nbody/cuda
Manually convert the source code of your choice to HIP
You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture.
Use the hipify-perl -inplace -print-stats
to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl.sh is in $ROCM_PATH/hip/bin
directory and should be in your path.
Compile the HIP programs
Fix any compiler issues, for example, if there was something that didn’t hipify correctly.
Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX.
For the nbody-orig.cu code, compile with hipcc -DSHMOO -I ../ nbody-orig.cu -o nbody-orig
. The #define SHMOO fixes some timer printouts. Add –offload-arch=<gpu_type> to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node.
Run the programs.
HIPFort is not installed by default, if you want to install it, follow the instructions:
Compile and execute HIPFort example:
cd hipfort-source/test/f2003/vecadd
hipfc -v --offload-arch=gfx90a hip_implementation.cpp main.f03
srun -n 1 --gpus 1 ./a.out
The first exercise will be the same as the one covered in the presentation so that we
can focus on the mechanics. Then there will be additional exercises to explore further
or you can start debugging your own applications.
Get the exercise: git clone https://github.com/AMD/HPCTrainingExamples.git
Go to HPCTrainingExamples/HIP/saxpy
Edit the saxpy.cpp
file and comment out the two hipMalloc lines.
Add a synchronization after the kernel call
Now let's try using rocgdb to find the error.
Compile the code with
hipcc --offload-arch=gfx90a -o saxpy saxpy.cpp
srun ./saxpy
Output
How do we find the error? Let's start up the debugger. First, we’ll recompile the code to help the debugging process. We also set the number of CPU OpenMP threads to reduce the number of threads seen by the debugger.
We have two options for running the debugger. We can use an interactive session, or we can just simply use a regular srun command.
srun rocgdb saxpy
The interactive approach uses:
We need to supply the jobid if we have more than one job so that it knows which to use.
We can also choose to use one of the Text User Interfaces (TUI) or Graphics User Interfaces (GUI). We look to see what is available.
We have the TUI interface for rocgdb. We need an interactive session on the compute node to run with this interface. We do this by using the following command.
The following is based on using the standard gdb interface. Using the TUI or GUI interfaces should be similar.
You should see some output like the following once the debugger starts.
Now it is waiting for us to tell it what to do. We'll go for broke and just type run
The line number 31 is a clue. Now let’s dive a little deeper by getting the GPU thread trace
Note that the GPU threads are also shown! Switch to thread 1 (CPU)
where
…
From here we can investigate the input to the kernel and see that the memory has not been allocated.
Restart the program in the debugger.
Must have optimized out some lines. We want to stop at the start of the routine before the allocations.
Better!
Should have intialized the pointer to NULL!
Prints out the next 5 values pointed to by x
Random values printed out – not initialized!
Until reach line 31
We can see that there are multiple problems with this kernel. X and Y are not initialized. Each value of X is multiplied by 1.0 and then added to the existing value of Y.
Additional exercises:
git clone https://github.com/AMD/HPCTrainingExamples.git
We assume you have reserved resources with salloc
Can compile and run all with
Or just run compile and run one case
HIP-nbody-orig.sh
or by watching the output from running the script.The binary is called nbody-orig
Use rocprof with --stats
Files with the prefix results are created
You can see information for each kernel call with their duration
--hip-trace
Now we have new files with the hip
in their name like below, check the file results.hip_stats.csv
--hsa-trace
results.hsa_stats.csv
From your laptop:
scp username@lumi.csc.fi:/path/results.json .
Click on the top left menu, "Open Trace File on the left top"
Select the file results.json
Zoom in/out: W/S
Move left/right: A/D
Read about the counters: vim /opt/rocm/rocprofiler/lib/gfx_metrics.xml
We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises
Reserve a GPU
Load Omnitrace
Declare PATH
and LD_LIBRARY_PATH
It is temporarily installed here: /project/project_465000388/software/omnitrace/1.7.3/
Execute:
export PATH=/project/project_465000388/software/omnitrace/1.7.3/bin:$PATH
export LD_LIBRARY_PATH=/project/project_465000388/software/omnitrace/1.7.3/lib:$LD_LIBRARY_PATH
Allocate resources with salloc
Check the various options and their values and also a second command for description
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description
srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all
Declare to use this configuration file: export OMNITRACE_CONFIG_FILE=/path/omnitrace_all.cfg
Get the file https://github.com/ROCm-Developer-Tools/HIP/tree/develop/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp
or cp /project/project_465000388/exercises/AMD/MatrixTranspose.cpp .
Compile hipcc --offload-arch=gfx90a -o MatrixTranspose MatrixTranspose.cpp
Execute the binary: time srun -n 1 --gpus 1 ./MatrixTranspose
and check the duration
time srun –n 1 –-gpus 1 omnitrace -- ./MatrixTranspose
and check the durationnm --demangle MatrixTranspose | egrep -i ' (t|u) '
srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose
Binary rewriting: srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose
Executing the new instrumented binary: time srun -n 1 --gpus 1 ./matrix.inst
and check the duration
See the list of the instrumented GPU calls: cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt
perfetto-trace.proto
to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the filesrun -n 1 --gpus 1 omnitrace-avail --all
OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy
srun -n 1 --gpus 1 ./matrix.inst
and copy the perfetto file and visualizeActivate in your configuration file OMNITRACE_USE_SAMPLING = true
and OMNITRACE_SAMPLING_FREQ = 100
, execute and visualize
omnitrace-binary-output/timestamp/wall_clock.txt
(replace binary and timestamp with your information)OMNITRACE_USE_TIMEMORY = true
and OMNITRACE_FLAT_PROFILE = true
, execute the code and open again the file omnitrace-binary-output/timestamp/wall_clock.txt
We have made built the Omniperf without GUI support for use in the exercises
Load Omniperf:
export PATH=/project/project_465000388/software/omniperf/bin/:$PATH
export PYTHONPATH=/project/project_465000388/software/omniperf/python-libs:$PYTHONPATH
module load cray-python
module load rocm
Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed
Run srun -n 1 --gpus 1 omniperf profile -h
to see all the options
Now is created a workload in the directory workloads with the name dgemmoh I mean for (the argument of the -n). So, we can analyze it
srun -n 1 --gpus 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt
But you need to know the code of the IP Block
Open the web page: http://IP:8050/ The IP will be displayed in the output
Use another cod, for example: https://github.com/amd/HPCTrainingExamples/blob/main/HIP/saxpy/saxpy.cpp