George Markomanolis
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # LUMI-G training (16/02/2023) We assume that you have already allocated resources with `salloc` `cp -r /projappl/project_465000388/exercises/AMD/HIP-Examples/ .` `salloc -N 1 -p small-g --gpus=1 -t 10:00 -A project_465000388` ``` module rm rocm module load craype-accel-amd-gfx90a module load PrgEnv-amd module load rocm ``` ## Basic examples `cd HIP-Examples/vectorAdd` Examine files here – README, Makefile and vectoradd_hip.cpp Notice that Makefile requires HIP_PATH to be set. Check with module show rocm or echo $HIP_PATH Also, the Makefile builds and runs the code. We’ll do the steps separately. Check also the HIPFLAGS in the Makefile. ``` make vectoradd_hip.exe srun -n 1 ./vectoradd_hip.exe ``` We can use SLURM submission script, let's call it `hip_batch.sh`: ``` #!/bin/bash #SBATCH -p small-g #SBATCH -N 1 #SBATCH --gpus=1 #SBATCH -t 10:00 #SBATCH -A project_465000388 module load craype-accel-amd-gfx90a module load rocm cd $HOME/HPCTrainingExamples/HIP/vectorAdd make vectoradd_hip.exe srun -n 1 --gpus 1 ./vectoradd_hip.exe ``` Submit the script `sbatch hip_batch.sh` Check for output in `slurm-<job-id>.out` or error in `slurm-<job-id>.err` Compile and run with Cray compiler ``` CC -x hip vectoradd_hip.cpp -o vectoradd_hip.exe srun -n 1 --gpus 1 ./vectoradd_hip.exe ``` Now let’s try the cuda-stream example. This example is from the original McCalpin code as ported to CUDA by Nvidia. This version has been ported to use HIP. See add4 for another similar stream example. ``` cd HIP-Examples/cuda-stream make srun -n 1 ./stream ``` Note that it builds with the hipcc compiler. You should get a report of the Copy, Scale, Add, and Triad cases. Check that we need to declare target GPU for MI250x, is the `--offload-arch=gfx90a` ## Hipify example We’ll use the same HIP-Examples that were downloaded for the first exercise Get a node allocation. ### Exercise 1: Manual code conversion from CUDA to HIP (10 min) Choose one or more of the CUDA samples in HIP-Examples/mini-nbody/cuda repository and manually convert them to HIP. Tip: for example, the cudaMalloc will be called hipMalloc. Some code suggestions include mini-nbody/cuda/<nbody-block.cu,nbody-orig.cu,nbody-soa.cu> The CUDA samples are located in HIP-Examples/mini-nbody/cuda Manually convert the source code of your choice to HIP You’ll want to compile on the node you’ve been allocated so that hipcc will choose the correct GPU architecture. ### Exercise 2: Code conversion from CUDA to HIP using HIPify tools (10 min) Use the `hipify-perl -inplace -print-stats` to “hipify” the CUDA samples you used to manually convert to HIP in Exercise 1. hipify-perl.sh is in `$ROCM_PATH/hip/bin` directory and should be in your path. * For example, if helloworld.cu is a CUDA program, run hipify-perl.sh -inplace –print-stats helloworld.cu. You’ll see a helloworld.cu.prehip file that is the original and the helloworld.cu file now has HIP calls. b. You’ll also see statistics of HIP APIs that were converted. For example, for hipify-perl -inplace -print-stats nbody-orig.cu: ``` [HIPIFY] info: file 'nbody-orig.cu' statistics: CONVERTED refs count: 7 TOTAL lines of code: 91 WARNINGS: 0 [HIPIFY] info: CONVERTED refs by names: cudaFree => hipFree: 1 cudaMalloc => hipMalloc: 1 cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1 cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1 ``` * Compile the HIP programs * Fix any compiler issues, for example, if there was something that didn’t hipify correctly. * Be on the lookout for hard-coded Nvidia specific things like warp sizes and PTX. * For the nbody-orig.cu code, compile with `hipcc -DSHMOO -I ../ nbody-orig.cu -o nbody-orig`. The #define SHMOO fixes some timer printouts. Add --offload-arch=<gpu_type> to specify the GPU type and avoid the autodetection issues when running on a single GPU on a node. * Run the programs. ## HIPFort HIPFort is not installed by default, if you want to install it, follow the instructions: ``` git clone https://github.com/ROCmSoftwarePlatform/hipfort hipfort-source mkdir hipfort-build; cd hipfort-build export HIPFORT_INSTALL_DIR=$PWD/hipfort cmake -DHIPFORT_INSTALL_DIR=${HIPFORT_INSTALL_DIR} ../hipfort-source make install export PATH=${HIPFORT_INSTALL_DIR}/bin:$PATH ``` Compile and execute HIPFort example: * `cd hipfort-source/test/f2003/vecadd` * `hipfc -v --offload-arch=gfx90a hip_implementation.cpp main.f03` * `srun -n 1 --gpus 1 ./a.out` ## Debugging The first exercise will be the same as the one covered in the presentation so that we can focus on the mechanics. Then there will be additional exercises to explore further or you can start debugging your own applications. Get the exercise: `git clone https://github.com/AMD/HPCTrainingExamples.git` Go to `HPCTrainingExamples/HIP/saxpy` Edit the `saxpy.cpp` file and comment out the two hipMalloc lines. ``` 41 //hipMalloc(&d_x, size); 42 //hipMalloc(&d_y, size); ``` Add a synchronization after the kernel call ``` 47 hipDeviceSynchronize(); ``` Now let's try using rocgdb to find the error. Compile the code with `hipcc --offload-arch=gfx90a -o saxpy saxpy.cpp` * Allocate a compute node. * Run the code `srun ./saxpy` Output ``` :0:rocdevice.cpp :2614: 283728925446 us: 24406: [tid:0x148c691ca700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b srun: error: nid007303: task 0: Aborted (core dumped) srun: launch/slurm: _step_signal: Terminating StepId=2831785.0 ``` How do we find the error? Let's start up the debugger. First, we’ll recompile the code to help the debugging process. We also set the number of CPU OpenMP threads to reduce the number of threads seen by the debugger. ``` hipcc -ggdb -O0 --offload-arch=gfx90a -o saxpy saxpy.cpp export OMP_NUM_THREADS=1 ``` We have two options for running the debugger. We can use an interactive session, or we can just simply use a regular srun command. `srun rocgdb saxpy` The interactive approach uses: ``` srun --interactive --pty [--jobid=<jobid>] bash rocgdb ./saxpy ``` We need to supply the jobid if we have more than one job so that it knows which to use. We can also choose to use one of the Text User Interfaces (TUI) or Graphics User Interfaces (GUI). We look to see what is available. ``` which cgdb -- not found -- run with cgdb -d rocgdb <executable> which ddd -- not found -- run with ddd --debugger rocgdb which gdbgui -- not found -- run with gdbgui --gdb-cmd /opt/rocm/bin/rocgdb rocgdb –tui -- found ``` We have the TUI interface for rocgdb. We need an interactive session on the compute node to run with this interface. We do this by using the following command. ``` srun --interactive --pty [-jobid=<jobid>] bash rocgdb -tui ./saxpy ``` The following is based on using the standard gdb interface. Using the TUI or GUI interfaces should be similar. You should see some output like the following once the debugger starts. ``` [output] GNU gdb (rocm-rel-5.1-36) 11.2 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://github.com/ROCm-Developer-Tools/ROCgdb/issues>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./saxpy... ``` Now it is waiting for us to tell it what to do. We'll go for broke and just type `run` ``` (gdb) run [output] Thread 3 "saxpy" received signal SIGSEGV, Segmentation fault.[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x000015554a001094 in saxpy (n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 31 y[i] += a*x[i]; ``` The line number 31 is a clue. Now let’s dive a little deeper by getting the GPU thread trace ``` (gdb) info threads [ shorthand - i th ] [output] Id Target Id Frame 1 Thread 0x15555552d300 (LWP 40477) "saxpy" 0x000015554b67ebc9 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1 2 Thread 0x15554a9ac700 (LWP 40485) "saxpy" 0x00001555533e1c47 in ioctl () from /lib64/libc.so.6 * 3 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "saxpy" 0x000015554a001094 in saxpy ( n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 4 AMDGPU Wave 1:2:1:2 (0,0,0)/1 "saxpy" 0x000015554a001094 in saxpy ( n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 5 AMDGPU Wave 1:2:1:3 (1,0,0)/0 "saxpy" 0x000015554a001094 in saxpy ( n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 6 AMDGPU Wave 1:2:1:4 (1,0,0)/1 "saxpy" 0x000015554a001094 in saxpy ( n=<optimized out>, x=<optimized out>, incx=<optimized out>, y=<optimized out>, incy=<optimized out>) at saxpy.cpp:31 ``` Note that the GPU threads are also shown! Switch to thread 1 (CPU) ``` (gdb) thread 1 [ shorthand - t 1] [output] [Switching to thread 1 (Thread 0x15555552d300 (LWP 47136))] #0 0x000015554b67ebc9 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1 ``` where ... ``` #12 0x0000155553b5b419 in hipDeviceSynchronize () from /opt/rocm/lib/libamdhip64.so.5 #13 0x000000000020d6fd in main () at saxpy.cpp:47 (gdb) break saxpy.cpp:44 [ shorthand – b saxpy.cpp:46] [output] Breakpoint 2 at 0x21a830: file saxpy.cpp, line 46 (gdb) run [ shorthand – r ] Breakpoint 1, main () at saxpy.cpp:46 46 saxpy<<<num_groups, group_size>>>(n, d_x, 1, d_y, 1); ``` From here we can investigate the input to the kernel and see that the memory has not been allocated. Restart the program in the debugger. ``` srun --interactive --pty [-jobid=<jobid>] rocgdb ./saxpy (gdb) list 22,42 (gdb) b 39 [output] Breakpoint 1 at 0x219ea2: file saxpy.cpp, line 44. ``` Must have optimized out some lines. We want to stop at the start of the routine before the allocations. ``` (gdb) b main Breakpoint 2 at 0x219ea2: file saxpy.cpp, line 36. ``` Better! ``` (gdb) run [output] Starting program ... ... Breakpoint 2, main() at saxpy.cpp:36 36 int n=256; (gdb) p d_x [output] $1 = (float *) 0x219cd0 <_start> ``` Should have intialized the pointer to NULL! ``` (gdb) n [output] 37 std::size_t size = sizeof(float)*n; (gdb) n [output] Breakpoint 1, main () at saxpy.cpp:44 44 int num_groups = 2; (gdb) p d_x [output] $2 = (float *) 0x219cd0 <_start> (gdb) p *x@5 ``` Prints out the next 5 values pointed to by x ``` [output] $3 = {-2.43e-33, 2.4e-33, -1.93e22, 556, 2.163e-36} ``` Random values printed out – not initialized! ``` (gdb) b 30 (gdb) c [output] Thread 5 “saxpy” hit Breakpoint 3 …. 30 if (i < n) (gdb) info threads Shows both CPU and GPU threads (gdb) p x [output] $4 = (const float *) 0x219cd0 <_start> (gdb) p *x@5 [output] $5 = {-2.43e-33, 2.4e-33, -1.93e22, 556, 2.163e-36} (gdb) n (gdb) n (gdb) n ``` Until reach line 31 ``` (gdb) p I [output] $6 = 0 (gdb) p y[0] [output] $7 = -2.12e14 (gdb) p x[0] [output] $8 = -2.43e-33 (gdb) p a [output] $9 = 1 ``` We can see that there are multiple problems with this kernel. X and Y are not initialized. Each value of X is multiplied by 1.0 and then added to the existing value of Y. Additional exercises: 1. Fix this code and step through it with rocgdb to verify that it is working correctly. 2. Pick one of the OpenMP (or OpenACC) examples in HPCTrainingExamples/Pragma_Examples/ [OpenMP|OpenACC]/Make/C/[reduction|saxpy|vecadd] and run through it to verify correctness. Change the code to cause an error and find it with the debugger. (See Stubbs video for an introduction to debugging OpenMP or OpenACC code. Hint: it is not that much different). ## Profiling Tools ### Rocprof * Get the exercise: `git clone https://github.com/AMD/HPCTrainingExamples.git` We assume you have reserved resources with `salloc` * Compile and run the code ``` cd HPCTrainingExamples/HIPIFY/mini-nbody/hip ``` Can compile and run all with ``` ./HIP-nbody-orig.sh ``` Or just run compile and run one case ``` hipcc --offload-arch=gfx90a -I../ -DSHMOO nbody-orig.cpp -o nbody-orig srun -n 1 ./nbody-orig 65536 65536, 161.871 ``` * Note that we can find the compile line and the executable name by checking the file `HIP-nbody-orig.sh` or by watching the output from running the script. ``` cat HIP-nbody-orig.sh ... EXE=nbody-orig ... ./$EXE 65536 ... ``` * The binary is called `nbody-orig` * Use rocprof with `--stats` ``` srun -n 1 rocprof --stats nbody-orig 65536 RPL: on '221130_200946' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_200946_3670592' RPL: result dir '/tmp/rpl_data_221130_200946_3670592/input_results_221130_200946' ROCProfiler: input from "/tmp/rpl_data_221130_200946_3670592/input.xml" 0 metrics 65536, 159.960 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_200946_3670592/input_results_221130_200946 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating ``` Files with the prefix results are created * Check the files results.csv You can see information for each kernel call with their duration ``` cat results.csv "Index","KernelName","gpu-id","queue-id","queue-index","pid","tid","grd","wgr","lds","scr","arch_vgpr","accum_vgpr","sgpr","wave_size","sig","obj","DispatchNs","BeginNs","EndNs","CompleteNs","DurationNs" 0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372809346673,1591372809872935,1591372836189584,1591372836215944,26316649 1,"bodyForce(Body*, float, int) [clone .kd]",0,0,2,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372836774261,1591372837000949,1591372863116796,1591372863132315,26115847 2,"bodyForce(Body*, float, int) [clone .kd]",0,0,4,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372863652552,1591372863877281,1591372889980009,1591372889994436,26102728 3,"bodyForce(Body*, float, int) [clone .kd]",0,0,6,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372890512133,1591372890735562,1591372916796147,1591372916817087,26060585 4,"bodyForce(Body*, float, int) [clone .kd]",0,0,8,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372917332974,1591372917556629,1591372943652575,1591372943667909,26095946 5,"bodyForce(Body*, float, int) [clone .kd]",0,0,10,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372944181896,1591372944405378,1591372970475883,1591372970491020,26070505 6,"bodyForce(Body*, float, int) [clone .kd]",0,0,12,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372971009527,1591372971233309,1591372997318181,1591372997339821,26084872 7,"bodyForce(Body*, float, int) [clone .kd]",0,0,14,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591372997856209,1591372998080743,1591373024164495,1591373024180993,26083752 8,"bodyForce(Body*, float, int) [clone .kd]",0,0,16,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373024701060,1591373024924818,1591373051023611,1591373051040364,26098793 9,"bodyForce(Body*, float, int) [clone .kd]",0,0,18,3670615,3670615,65536,256,0,0,20,4,16,64,"0x0","0x7f7b27c04500",1591373051559851,1591373051782878,1591373077878145,1591373077902255,26095267 ``` * Check the statistics result file, one line per kernel ``` cat results.stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "bodyForce(Body*, float, int) [clone .kd]",10,261124944,26112494,100.0 ``` * Profile the HIP calls with `--hip-trace` ``` srun -n 1 rocprof --stats --hip-trace nbody-orig 65536 RPL: on '221130_201416' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_201416_3670892' RPL: result dir '/tmp/rpl_data_221130_201416_3670892/input_results_221130_201416' ROCTracer (pid=3670915): HIP-trace() 65536, 161.051 hsa_copy_deps: 0 scan ops data 29:30 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating dump json 19:20 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating dump json 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating dump json 9:10 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating ``` Now we have new files with the `hip` in their name like below, check the file `results.hip_stats.csv` ``` cat results.hip_stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "hipMemcpy",20,486845521,24342276,99.89375113830629 "hipLaunchKernel",10,467008,46700,0.09582337501179998 "hipMalloc",1,30570,30570,0.006272527610042495 "hipFree",1,14210,14210,0.0029156891507590398 "__hipPushCallConfiguration",10,3510,351,0.0007202018943817191 "__hipPopCallConfiguration",10,2520,252,0.0005170680267355932 ``` * Profile also the HSA API with the `--hsa-trace` ``` srun -n 1 rocprof --stats --hip-trace --hsa-trace nbody-orig 65536 RPL: on '221130_201737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file '' RPL: output dir '/tmp/rpl_data_221130_201737_3671219' RPL: result dir '/tmp/rpl_data_221130_201737_3671219/input_results_221130_201737' ROCProfiler: input from "/tmp/rpl_data_221130_201737_3671219/input.xml" 0 metrics ROCTracer (pid=3671242): HSA-trace() HSA-activity-trace() HIP-trace() 65536, 155.978 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_201737_3671219/input_results_221130_201737 hsa_copy_deps: 1 scan hsa API data 5953:5954 hsa_copy_deps: 0 scan hip API data 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.csv' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.stats.csv' is generating dump json 9:10 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hsa_stats.csv' is generating dump json 5963:5964 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.copy_stats.csv' is generating dump json 19:20 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.hip_stats.csv' is generating dump json 51:52 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/results.json' is generating ``` * See the content of the file `results.hsa_stats.csv` ``` cat results.hsa_stats.csv "Name","Calls","TotalDurationNs","AverageNs","Percentage" "hsa_signal_wait_scacquire",50,264955082,5299101,82.69977799679005 "hsa_queue_create",1,39868279,39868279,12.443987854568068 "hsa_amd_memory_async_copy",20,4917141,245857,1.5347751249357586 "hsa_amd_signal_async_handler",20,4262555,213127,1.3304608069344652 "hsa_signal_store_screlease",40,1945998,48649,0.60739956889069 "hsa_amd_memory_lock_to_pool",20,1418202,70910,0.44265990170591873 "hsa_amd_memory_unlock",20,723957,36197,0.22596691758953363 "hsa_agent_get_info",80,671007,8387,0.20943976433821374 "hsa_amd_memory_pool_allocate",5,597926,119585,0.18662917157599068 "hsa_system_get_info",5005,367139,73,0.11459419296574767 "hsa_executable_load_agent_code_object",2,216629,108314,0.0676158768966984 "hsa_executable_freeze",2,156380,78190,0.048810504729771616 "hsa_amd_agents_allow_access",4,89470,22367,0.02792605101785821 "hsa_signal_create",57,51500,903,0.016074568318092074 "hsa_code_object_reader_create_from_memory",2,20940,10470,0.006535950690890253 "hsa_isa_get_info_alt",2,18180,9090,0.005674478680056581 "hsa_signal_load_relaxed",236,17880,75,0.005580840418009442 "hsa_system_get_major_extension_table",3,14920,4973,0.004656942899144344 "hsa_amd_profiling_get_async_copy_time",40,13430,335,0.004191872864310224 "hsa_executable_create_alt",2,9030,4515,0.0028185116876188626 "hsa_amd_memory_pool_get_info",106,7570,71,0.002362805478989456 "hsa_amd_agent_iterate_memory_pools",27,7440,275,0.0023222288987690297 "hsa_amd_memory_pool_free",1,4710,4710,0.0014701207141400712 "hsa_executable_get_symbol_by_name",15,3599,239,0.0011233470170255023 "hsa_queue_add_write_index_screlease",20,3500,175,0.0010924463905499467 "hsa_amd_profiling_get_dispatch_time",20,3110,155,0.0009707166498886669 "hsa_signal_silent_store_relaxed",40,2800,70,0.0008739571124399574 "hsa_amd_agent_memory_pool_get_info",19,2490,131,0.0007771975749912477 "hsa_iterate_agents",2,2250,1125,0.0007022869653535371 "hsa_queue_load_read_index_relaxed",20,2240,112,0.0006991656899519659 "hsa_signal_destroy",20,2190,109,0.0006835593129441095 "hsa_queue_load_read_index_scacquire",20,1790,89,0.0005587082968812585 "hsa_executable_symbol_get_info",30,1540,51,0.00048067641184197657 "hsa_amd_profiling_async_copy_enable",1,440,440,0.00013733611766913615 "hsa_agent_iterate_isas",1,400,400,0.00012485101606285104 "hsa_amd_profiling_set_profiler_enabled",1,140,140,4.3697855621997866e-05 "hsa_dispatch",10,0,0,0.0 ``` * Download the results.json file on your laptop From your laptop: `scp username@lumi.csc.fi:/path/results.json .` * Visit the web page: https://ui.perfetto.dev/ * Click on the top left menu, "Open Trace File on the left top" * Select the file results.json Zoom in/out: W/S Move left/right: A/D ![](https://i.imgur.com/ZKgVBKI.png) Read about the counters: ` vim /opt/rocm/rocprofiler/lib/gfx_metrics.xml` * Create a file with the contents: ``` cat rocprof_counters.txt pmc : Wavefronts VALUInsts VFetchInsts VWriteInsts VALUUtilization VALUBusy WriteSize pmc : SALUInsts SFetchInsts LDSInsts FlatLDSInsts GDSInsts SALUBusy FetchSize pmc : L2CacheHit MemUnitBusy MemUnitStalled WriteUnitStalled ALUStalledByLDS LDSBankConflict ``` * Execute with using the counters ``` srun -n 1 rocprof --timestamp on -i rocprof_counters.txt nbody-orig 65536 RPL: on '221130_205737' from '/global/software/rocm/rocm-5.3.0' in '/global/home/gmarko/HIP-Examples/mini-nbody/hip' RPL: profiling '"nbody-orig" "65536"' RPL: input file 'rocprof_counters.txt' RPL: output dir '/tmp/rpl_data_221130_205737_3673574' RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input0.xml" gpu_index = kernel = range = 7 metrics Wavefronts, VALUInsts, VFetchInsts, VWriteInsts, VALUUtilization, VALUBusy, WriteSize 65536, 155.389 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input0_results_221130_205737 RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input1.xml" gpu_index = kernel = range = 7 metrics SALUInsts, SFetchInsts, LDSInsts, FlatLDSInsts, GDSInsts, SALUBusy, FetchSize 65536, 156.996 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input1_results_221130_205737 RPL: result dir '/tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737' ROCProfiler: input from "/tmp/rpl_data_221130_205737_3673574/input2.xml" gpu_index = kernel = range = 6 metrics L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, ALUStalledByLDS, LDSBankConflict 65536, 155.264 ROCPRofiler: 10 contexts collected, output directory /tmp/rpl_data_221130_205737_3673574/input2_results_221130_205737 File '/global/home/gmarko/HIP-Examples/mini-nbody/hip/rocprof_counters.csv' is generating ``` * Contents of the rocprof_counters.csv file ``` cat rocprof_counters.csv Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,Wavefronts,VALUInsts,VFetchInsts,VWriteInsts,VALUUtilization,VALUBusy,WriteSize,SALUInsts,SFetchInsts,LDSInsts,FlatLDSInsts,GDSInsts,SALUBusy,FetchSize,L2CacheHit,MemUnitBusy,MemUnitStalled,WriteUnitStalled,ALUStalledByLDS,LDSBankConflict,DispatchNs,BeginNs,EndNs,CompleteNs 0,"bodyForce(Body*, float, int) [clone .kd]",0,0,0,3673711,3673711,65536,256,0,0,20,4,16,64,0x0,0x7f2b4d282500,2048.0000000000,1212443.0000000000,12.0000000000,12.0000000000,100.0000000000,68.1476813493,7872.0000000000,131228.5000000000,65553.0000000000,0.0000000000,0.0000000000,0.0000000000,6.3483148000,9429.1875000000,96.5684331443,0.0250344612,0.0044357832,0.0102024550,0.0000000000,0.0000000000,1594244102859719,1594244111978746,1594244138305243,1594244138330792 ... ``` ### Omnitrace We have made special builds of the Omnitools, omnitrace and omniperf for use in the exercises * Reserve a GPU * Load Omnitrace Declare `PATH` and `LD_LIBRARY_PATH` It is temporarily installed here: `/project/project_465000388/software/omnitrace/1.7.3/` Execute: * `export PATH=/project/project_465000388/software/omnitrace/1.7.3/bin:$PATH` * `export LD_LIBRARY_PATH=/project/project_465000388/software/omnitrace/1.7.3/lib:$LD_LIBRARY_PATH` * Allocate resources with `salloc` * Check the various options and their values and also a second command for description `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace` `srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description` * Create an Omnitrace configuration file with description per option `srun -n 1 omnitrace-avail -G omnitrace_all.cfg --all` * Declare to use this configuration file: `export OMNITRACE_CONFIG_FILE=/path/omnitrace_all.cfg` * Get the file https://github.com/ROCm-Developer-Tools/HIP/tree/develop/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp or `cp /project/project_465000388/exercises/AMD/MatrixTranspose.cpp .` * Compile `hipcc --offload-arch=gfx90a -o MatrixTranspose MatrixTranspose.cpp` * Execute the binary: `time srun -n 1 --gpus 1 ./MatrixTranspose` and check the duration #### Dynamic instrumentation * Execute dynamic instrumentation: `time srun –n 1 –-gpus 1 omnitrace -- ./MatrixTranspose` and check the duration * Check what the binary calls and gets instrumented: `nm --demangle MatrixTranspose | egrep -i ' (t|u) '` * Available functions to instrument: `srun -n 1 --gpus 1 omnitrace -v -1 --simulate --print-available functions -- ./MatrixTranspose` * the simulate option means that it will not execute the binary #### Binary rewriting * Binary rewriting: `srun -n 1 --gpus 1 omnitrace -v -1 --print-available functions -o matrix.inst -- ./MatrixTranspose` * We created a new instrumented binary called matrix.inst * Executing the new instrumented binary: `time srun -n 1 --gpus 1 ./matrix.inst` and check the duration * See the list of the instrumented GPU calls: `cat omnitrace-matrix.inst-output/TIMESTAMP/roctracer.txt` #### Visualization * Copy the `perfetto-trace.proto` to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the file #### Hardware counters * See a list of all the counters: `srun -n 1 --gpus 1 omnitrace-avail --all` * Declare in your configuration file: `OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts,VALUBusy,L2CacheHit,MemUnitBusy` * Execute: `srun -n 1 --gpus 1 ./matrix.inst` and copy the perfetto file and visualize #### Sampling Activate in your configuration file `OMNITRACE_USE_SAMPLING = true` and `OMNITRACE_SAMPLING_FREQ = 100`, execute and visualize #### Kernel timings * Open the file `omnitrace-binary-output/timestamp/wall_clock.txt` (replace binary and timestamp with your information) * In order to see the kernels gathered in your configuration file, make sure that `OMNITRACE_USE_TIMEMORY = true` and `OMNITRACE_FLAT_PROFILE = true`, execute the code and open again the file `omnitrace-binary-output/timestamp/wall_clock.txt` ### Omniperf We have made built the Omniperf without GUI support for use in the exercises * Load Omniperf: `export PATH=/project/project_465000388/software/omniperf/bin/:$PATH` `export PYTHONPATH=/project/project_465000388/software/omniperf/python-libs:$PYTHONPATH` `module load cray-python` `module load rocm` * Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed ``` salloc -p standard-g --gpus 1 -t 00:40:00 -A project_465000388 git clone https://github.com/AMD/HPCTrainingExamples.git cd HPCTrainingExamples/HIP/dgemm/ mkdir build cd build cmake .. make cd bin srun -n 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv ``` * Run `srun -n 1 --gpus 1 omniperf profile -h` to see all the options * Now is created a workload in the directory workloads with the name dgemmoh I mean for (the argument of the -n). So, we can analyze it ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ &> dgemm_analyze.txt ``` * If you want to only roofline analysis, then execute: `srun -n 1 --gpus 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv` There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file `dgemm_analyze.txt` * We can select specific IP Blocks, like: ``` srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2 ``` But you need to know the code of the IP Block * If you have installed Omniperf on your laptop (no ROCm required for analysis) then you can download the data and execute: ``` omniperf analyze -p workloads/dgemm/mi200/ --gui ``` * Open the web page: http://IP:8050/ The IP will be displayed in the output * Use another cod, for example: https://github.com/amd/HPCTrainingExamples/blob/main/HIP/saxpy/saxpy.cpp

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully