### November 30 Reservation: enccs_2 #### Point-to-point and collective Get the node with at least two GPUs and tasks: `salloc -N 1 -p MI250 --gpus=2 --ntasks 2` Load OpenMPI module `module load openmpi/4.1.4-gcc` Change mpicxx wrapper compiler to use hipcc `export OMPI_CXX=hipcc` Compile and run the code `mpicxx -o ./pt2pt ./pt2pt.cpp $mpirun -n 2 ./pt2pt` You can get around the message "WARNING: There was an error initializing an OpenFabrics device" by telling OpenMPI to exclude openib: `mpirun -n 2 --mca btl ^'openib' ./pt2pt` #### OSU Bandwidth benchmark code Get a node allocation. Check what is available with sinfo. Then `salloc -N 1 --ntasks 8 –gpus=8 -p MI250` ##### Set up environment `module load rocm openmpi/4.1.4-gcc` ##### Check if OpenMPI is built with UCX `ompi_info` ##### Check if UCX is built with ROCm? `/global/software/openmpi/gcc/ucx/bin/ucx_info -v` ##### Get Ohio State University micro-benchmark code `wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.0.tar.gz` ##### Build micro-benchmark code ``` ./configure --prefix=${HOME}/osu-mb \ CC=/global/software/openmpi/gcc/ompi/bin/mpicc \ CXX=/global/software/openmpi/gcc/ompi/bin/mpicxx \ --enable-rocm --with-rocm=${ROCM_PATH} make –j12 make install ``` ##### Run benchmark ``` export HIP_VISIBLE_DEVICES=0,1 mpirun –N 1 –n 2 ./osu-mb/mpi/pt2pt/osu_bw $((16*1024*1024)):$((16*1024*1024)) D D ``` Notes: * Try different pairs of GPUs. How does the bandwidth vary? * Try different communication options (blit kernel and SDMA) using the env variable HSA_ENABLE_SDMA. How does the bandwidth vary? #### AMD Node Memory Model ##### Managed Memory We’ll use the examples in the AMDTrainingExamples from before ``` tar -xzvf AMDTrainingExamples_ver0.2.tgz cd AMDTrainingExamples/ManagedMemory ``` The example from HIP-Examples/vectorAdd has been duplicated here as vectoradd_hip.cpp. A slightly cleaned up version is in vectoradd_hip1.cpp. Check that you have the original version running first. ``` module load rocm make vectoradd_hip1.exe ./vectoradd_hip1.exe ``` The original tests should run. If you encounter difficulties, you can check what is happening by setting the following environment variables. ``` export LIBOMPTARGET_KERNEL_TRACE=1 export LIBOMPTARGET_INFO=$((0x20 | 0x02 | 0x01 | 0x10)) ``` Take that code and turn it into a managed memory version with the following steps: 1. Globally change all “host” strings to “vector” 2. Globally change all “device” strings to “vector” 3. Remove duplicate float declarations 4. Move both allocations above initialization loop 5. Comment out all hip data copies from host to device and device to host 6. Add hipDeviceSynchronize(); after the kerel launch * First experiment: * comment out the hipMalloc/hipFrees * Test should fail with a memory access fault. Now set HSA_XNACK `export HSA_XNACK=1` Rerun and test should pass * Second experiment: * comment out the malloc/frees instead and unset the HSA_XNACK variable or set it to 0 * Test should pass ##### OpenMP Atomics The examples for this exercise are in `AMDTrainingExamples/atomics_openmp`. Set up your environment ``` module load aomp rocm export CC=${AOMP}/bin/clang ``` Now let’s look at the first example. The key lines are: ``` #pragma omp target teams distribute parallel for map(tofrom: a[:n]) map(to: b[:n]) for(int i = 0; i < n; i++) { a[i] += b[i]; } ``` Build the first example and run it. The map clause will copy the data from the host to device and back. The memory allocated will be coarse grain. ``` Make arraysum1 ./arraysum1 ``` It should run and pass. Now make the following changes 1. Remove map clause 2. Add #pragma omp requires unified_shared_memory before main There is no memory movement so the system must move the memory for us. Let’s try it out. ``` make arraysum2 ./arraysum2 ``` It should fail because the memory has not moved. Setting HSA_XNACK should fix this problem. ``` export HSA_XNACK=1 ./arraysum2 ``` It should run now. The memory that is created is now fine-grained. We can change this back to coarse-grained by adding back the map clause. This is done in `arraysum3.c`. Now let’s move the initialization to a separate routine and use the map clause so that it is created as coarse-grained memory. ``` void init(int n, double *a, double *b) { #pragma omp target map(from:a[:n],b[:n]) for(int i = 0; i < n; i++){ a[i] = b[i] = 1.0; } } ``` The main loop is ``` #pragma omp target teams distribute parallel for for(int i = 0; i < n; i++) { a[i] += b[i]; } ``` The memory in the main loop remains coarse-grained even though a map clause is not used. Once the memory is allocated, it stays that type. Note that the initialization loop is not run in parallel on the GPU. How would you fix that? We’ll skip ahead to the `arraysum8.c` with an atomic reduction. The key loop is ``` #pragma omp target teams distribute parallel for reduction(+:ret) for(int i = 0; i < n; i++) { #pragma omp atomic hint(AMD_fast_fp_atomics) ret += b[i]; } ``` This test should fail because the memory is fine-grained. Add the map clause to the pragma as implemented in arraysum9.c. Note that the ret variable also needs to be mapped. `map(to: b[:n]) map(tofrom: ret)` Now compile and run. It should pass. ``` make arraysum9 ./arraysum9 ``` Another solution to fix the problem is to change the atomic pragma to a safe version. This is shown in arraysum10.c #pragma omp atomic hint(AMD_safe_fp_atomics) The safe atomic will use a compare and swap (CAS) loop instead of an atomic add. This will work, but it will likely be slower. The examples in arraysum4.c to arraysum7.c show the same atomic behavior with add the elements of array b to array a and storing it back to array a. ### Affinity #### Understanding your system Note: New hwloc installed here: `/global/training/enccs/hwloc/bin/` Try the following commands to learn more about how the system is configured and what cores and GCDs belong to each NUMA domain: ``` Allocate a node: salloc -N1 -p MI250 --gpus=8 --cpus-per-gpu=16 lstopo -p out.svg lscpu numactl -H rocm-smi --showtopo ``` #### Verifying Process and Thread Binding ##### Case 1: MPI + OpenMP * Allocate a node: salloc -N1 --gpus=8 --cpus-per-gpu=16 -p MI250 * Load necessary modules in the environment: module load openmpi/4.1.4-gcc * Download hello_mpi_omp.c and Makefile from https://code.ornl.gov/olcf/hello_mpi_omp * Modify Makefile to change the compiler: CC=/global/software/openmpi/gcc/ompi/bin/mpicc * Compile: make * Run the commands: `OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close mpirun -np 4 ./hello_mpi_omp` ``` MPI 003 - OMP 000 - HWT 048 - Node mun-node-5 MPI 003 - OMP 001 - HWT 049 - Node mun-node-5 MPI 000 - OMP 000 - HWT 000 - Node mun-node-5 MPI 000 - OMP 001 - HWT 001 - Node mun-node-5 MPI 002 - OMP 000 - HWT 032 - Node mun-node-5 MPI 002 - OMP 001 - HWT 033 - Node mun-node-5 MPI 001 - OMP 000 - HWT 016 - Node mun-node-5 MPI 001 - OMP 001 - HWT 017 - Node mun-node-5 ``` `OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close mpirun -np 4 --report-bindings ./hello_mpi_omp` Look for the binding report printed by mpirun `OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close mpirun -np 2 ./hello_mpi_omp` ``` MPI 000 - OMP 000 - HWT 000 - Node mun-node-5 MPI 000 - OMP 001 - HWT 000 - Node mun-node-5 MPI 001 - OMP 000 - HWT 001 - Node mun-node-5 MPI 001 - OMP 001 - HWT 001 - Node mun-node-5 ``` ==Notice that this is not giving us the intended binding.== Threads are bound to the same core, and cores 0 and 1 from the same socket are picked. From mpirun manpage, we see: ``` Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives: Bind to core: when the number of processes is <= 2 Bind to socket: when the number of processes is > 2 Bind to none: when oversubscribed ``` For this case, we don’t know yet how to set up affinity correctly for each thread. ##### Case 2: MPI + OpenMP + HIP * Allocate a node: salloc -N1 --gpus=8 --cpus-per-gpu=16 -p MI250 * Download hello_jobstep.cpp from here: https://code.ornl.gov/olcf/hello_jobstep * Load necessary modules in the environment: module load rocm openmpi/4.1.4-gcc * Use this simpler Makefile to compile: $ cat Makefile ``` SOURCES = hello_jobstep.cpp OBJECTS = $(SOURCES:.cpp=.o) EXECUTABLE = hello_jobstep CXX=/global/software/openmpi/gcc/ompi/bin/mpic++ CXXFLAGS = -fopenmp -I${ROCM_PATH}/include -D__HIP_PLATFORM_AMD__ LDFLAGS = -L${ROCM_PATH}/lib -lhsa-runtime64 -lamdhip64 all: ${EXECUTABLE} %.o: %.cpp $(CXX) $(CXXFLAGS) -o $@ -c $< $(EXECUTABLE): $(OBJECTS) $(CXX) $(CXXFLAGS) $(OBJECTS) -o $@ $(LDFLAGS) clean: rm -f $(EXECUTABLE) rm -f $(OBJECTS) ``` * Compile: make * Set up helper script for setting up ROCR_VISIBLE_DEVICES and GOMP_CPU_AFFINITY for indicating GPU and CPU core affinity respectively: ``` #!/bin/bash export global_rank=${OMPI_COMM_WORLD_RANK} export local_rank=${OMPI_COMM_WORLD_LOCAL_RANK} export ranks_per_node=${OMPI_COMM_WORLD_LOCAL_SIZE} if [ -z "${NUM_CPUS}" ]; then let NUM_CPUS=128 fi if [ -z "${RANK_STRIDE}" ]; then let RANK_STRIDE=${NUM_CPUS}/${ranks_per_node} fi if [ -z "${OMP_STRIDE}" ]; then let OMP_STRIDE=1 fi if [ -z "${NUM_GPUS}" ]; then let NUM_GPUS=8 fi if [ -z "${GPU_START}" ]; then let GPU_START=0 fi if [ -z "${GPU_STRIDE}" ]; then let GPU_STRIDE=1 fi cpu_list=($(seq 0 127)) let cpus_per_gpu=${NUM_CPUS}/${NUM_GPUS} let cpu_start_index=$(( ($RANK_STRIDE*${local_rank})+${GPU_START}*$cpus_per_gpu )) let cpu_start=${cpu_list[$cpu_start_index]} let cpu_stop=$(($cpu_start+$OMP_NUM_THREADS*$OMP_STRIDE-1)) gpu_list=(4 5 2 3 6 7 0 1) let ranks_per_gpu=$(((${ranks_per_node}+${NUM_GPUS}-1)/${NUM_GPUS})) let my_gpu_index=$(($local_rank*$GPU_STRIDE/$ranks_per_gpu))+${GPU_START} let my_gpu=${gpu_list[${my_gpu_index}]} export GOMP_CPU_AFFINITY=$cpu_start-$cpu_stop:$OMP_STRIDE export ROCR_VISIBLE_DEVICES=$my_gpu "$@" ``` Run the application using the helper script: `OMP_NUM_THREADS=2 mpirun -np 8 ./helper.sh ./hello_jobstep` * Runs 2 threads per rank, 8 ranks, associating the GPUs in the order given for each rank and binding 2 CPU cores from each set of 16 cores for each rank ``` MPI 002 - OMP 000 - HWT 032 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 002 - OMP 001 - HWT 033 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 001 - OMP 000 - HWT 016 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 001 - OMP 001 - HWT 017 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 004 - OMP 000 - HWT 064 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 004 - OMP 001 - HWT 065 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 006 - OMP 000 - HWT 096 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 MPI 006 - OMP 001 - HWT 097 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 MPI 000 - OMP 000 - HWT 000 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 000 - OMP 001 - HWT 001 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 005 - OMP 000 - HWT 080 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 005 - OMP 001 - HWT 081 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 003 - OMP 000 - HWT 048 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 003 - OMP 001 - HWT 049 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 007 - OMP 000 - HWT 112 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c MPI 007 - OMP 001 - HWT 113 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c ``` `OMP_NUM_THREADS=2 mpirun -np 16 ./helper.sh ./hello_jobstep` * To run 2 ranks per GCD packed closely (ranks 0 and 1 run on GCD 4) and bind 2 cores from each set of 8 cores for each rank. ``` MPI 012 - OMP 000 - HWT 096 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 MPI 012 - OMP 001 - HWT 097 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 MPI 008 - OMP 000 - HWT 011 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 008 - OMP 001 - HWT 001 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 011 - OMP 000 - HWT 052 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 011 - OMP 001 - HWT 063 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 010 - OMP 000 - HWT 034 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 010 - OMP 001 - HWT 043 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID b6 MPI 009 - OMP 000 - HWT 018 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 009 - OMP 001 - HWT 025 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID b3 MPI 014 - OMP 000 - HWT 112 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c MPI 014 - OMP 001 - HWT 113 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c MPI 006 - OMP 000 - HWT 048 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 006 - OMP 001 - HWT 049 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 015 - OMP 000 - HWT 120 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c MPI 015 - OMP 001 - HWT 121 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 2c MPI 001 - OMP 000 - HWT 008 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 001 - OMP 001 - HWT 009 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 007 - OMP 000 - HWT 056 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 007 - OMP 001 - HWT 057 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 32 MPI 004 - OMP 000 - HWT 032 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 004 - OMP 001 - HWT 033 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 005 - OMP 000 - HWT 040 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 005 - OMP 001 - HWT 041 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 2f MPI 002 - OMP 000 - HWT 016 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 002 - OMP 001 - HWT 017 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 000 - OMP 000 - HWT 000 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 000 - OMP 001 - HWT 001 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID ad MPI 003 - OMP 000 - HWT 024 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 003 - OMP 001 - HWT 025 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID b0 MPI 013 - OMP 000 - HWT 104 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 MPI 013 - OMP 001 - HWT 105 - Node mun-node-5 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 29 ``` ---