HIP 101 Porting CUDA codes to HIP

# HIP 101 Porting CUDA codes to HIP ### 26 February 2021 ## Schedule | Time (in CET) | Topic | | ---- | --- | | 09:00 - 10:00 | Introduction to AMD architecture and HIP| | 10:00 - 10:15 | Break | | 10:15 - 10:45 | Deep dive to Hipify tools and examples | | 10:45 - 11:30 | Lunch | | 11:30 - 16:00 | Hands-on sessions | ## Important Information Github repository: https://github.com/csc-training/hip Submitting jobs to Puhti: https://docs.csc.fi/computing/running/submitting-jobs/ SLURM reservation for this training: _gpu_training_ AMD porting guide: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html AMD ROCm documentation: https://rocmdocs.amd.com/en/latest/ ## Team Instructor: George Markomanolis Team: Cristian-Vasile Achim, Jussi Enkovaara, Fredrik Robertsen, Nicolino lo Gullo ## Structure of the repository ```bash docs porting └── codes ├── saxpy/cuda ├── saxpy/cublas ├── Discrete_Hankel_Transform ├── Heat-Equation ├── 2D Wave Propagation ├── KMeans clustering └── Vector_addition ``` ## Puhti ### Connect to Puhti ```bash ssh trainingXXX@puhti.csc.fi ``` * Give your password and you should be located in the directory: ```bash=num /users/trainingXXX ``` ### Explore the environment ```bash= module list Currently Loaded Modules: 1) intel/19.0.4 2) hpcx-mpi/2.4.0 3) intel-mkl/2019.0.4 4) StdEnv ``` ### SLURM * Submit script `sub.sh` ```bash sbatch sub.sh ``` * Check the status of a job ```bash squeue -u $USER ``` * Cancel a job ```bash scancel JOBID ``` * Sample batch job script ```bash= #!/bin/bash #SBATCH --job-name=hip_test #SBATCH --account=project_2000745 #SBATCH --partition=gpu #SBATCH --time=00:05:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=8000 #SBATCH --gres=gpu:v100:1 #SBATCH --reservation=gpu_training module load hip srun my_hip_program ``` ### HIP * Load HIP module ```bash module load hip/4.0.0c module list Currently Loaded Modules: 1) StdEnv 2) gcc/9.1.0 3) cuda/11.1.0 4) hip/4.0.0c 5) intel-mkl/2019.0.4 6) hpcx-mpi/2.4.0 ``` There is also a module _hip/4.0.0_ but we created also one _hip/4.0.0c_ which is an installation from the source code. The name will comply with the version in the future. * hipconfig ```bash= hipconfig HIP version : 4.0.20496-4f163c6 == hipconfig HIP_PATH : /appl/opt/rocm/rocm-4.0.0c/hip ROCM_PATH : /appl/opt/rocm/rocm-4.0.0c/ HIP_COMPILER : clang HIP_PLATFORM : nvcc HIP_RUNTIME : ROCclr CPP_CONFIG : -D__HIP_PLATFORM_NVCC__= -I/appl/opt/rocm/rocm-4.0.0c/hip/include -I/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include == nvcc nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:10:02_PDT_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.TC455_06.29069683_0 === Environment Variables PATH=/appl/opt/rocm/rocm-4.0.0c/hip/bin:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/bin:/appl/opt/ucx/1.9.0-cuda/bin:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/bin:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/bin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/bin:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/appl/bin:/users/markoman/.local/bin:/users/markoman/bin CUDA_PATH=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/ HIPFORT_ARCH=nvptx HIP_PLATFORM=nvcc LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hip/lib:/appl/spack/install-tree/gcc-9.1.0/hwloc-2.0.2-wqrgpf/lib:/appl/opt/ucx/1.9.0-cuda/lib:/appl/spack/install-tree/gcc-9.1.0/openmpi-4.0.5-ym53tz/lib:/appl/spack/install-tree/gcc-9.1.0/hdf5-1.12.0-wtlera/lib:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/tbb/lib/intel64_lin/gcc4.7:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/compiler/lib/intel64_lin:/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64_lin:/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib64:/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/lib:/appl/opt/rocm/rocm-4.0.0/hiprand/lib:/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib HIP_RUNTIME=ROCclr HIPFORT_GPU=sm_70 CUDA_INSTALL_ROOT=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2 HIPFORT_HOME=/appl/opt/rocm/rocm-4.0.0c//hipfort/ HIPFORT_ARCHGPU=nvptx-sm_70 HIPCC_OPTS=--x cu HIP_COMPILER=clang HIP_PATH=/appl/opt/rocm/rocm-4.0.0c/hip == Linux Kernel Hostname : puhti-login1.bullx Linux puhti-login1.bullx 3.10.0-1062.33.1.el7.x86_64 #1 SMP Thu Aug 13 10:55:03 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 7.7 (Maipo) Release: 7.7 Codename: Maipo ``` * The wrapper to compile on NVIDIA system is called _hipcc_ ```bash= which hipcc /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc ``` * You can read the file _/appl/opt/rocm/rocm-4.0.0c/hip/bin/hipcc_ for more information ```bash= hipcc -h Usage : nvcc [options] <inputfile> Options for specifying the compilation phase ============================================ More exactly, this option specifies up to which stage the input files must be compiled, according to the following compilation trajectories for different input file types: .c/.cc/.cpp/.cxx : preprocess, compile, link .o : link .i/.ii : compile, link .cu : preprocess, cuda frontend, PTX assemble, merge with host C code, compile, link .gpu : cicc compile into cubin .ptx : PTX assemble into cubin. ``` ## Porting CUDA codes to HIP ### General Guidelines * Start porting the CUDA codes on an NVIDIA system * When it is finished, compile the code with HIP on an AMD system (no access to AMD hardware yet) * HIP can be used on both AMD and NVIDIA GPUs * The script __hipconvertinplace-perl.sh__ can hipify all the files in a directory * Some HIP libraries seem not to work on NVIDIA systems ### VERBOSE Mode * If you want to see the command that is executed from hipcc, declare the following_ ```bash export HIPCC_VERBOSE=1 ``` * For example, on Puhti, the command: ```bash hipcc "--gpu-architecture=sm_70" -g -O3 -I../common -c core_cuda.cu -o core_cuda.o ``` would also print the command that was actually executed: ```bash hipcc-cmd: /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//bin/nvcc -D__HIP_ROCclr__ -Wno-deprecated-gpu-targets -isystem /appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2//include -isystem /appl/opt/rocm/rocm-4.0.0c/hip/include --gpu-architecture=sm_70 -g -O3 -I../common -c core_cuda.cu -o core_cuda.o ``` ### Debug hipcc Add in your submission script before srun: ``` export AMD_LOG_LEVEL=4 ``` For example, with no debug mode: ``` srun: error: r01g01: task 0: Segmentation fault srun: Terminating job step 4339273.0 ``` with debug mode: ``` :3:rocdevice.cpp :458 : 2193024923864 us: Initializing HSA stack. :1:rocdevice.cpp :466 : 2193024923948 us: hsa_init failed. :4:runtime.cpp :82 : 2193024923950 us: init srun: error: r01g01: task 0: Segmentation fault srun: Terminating job step 4339273.0 ``` The outcome is that the used library does require AMD hardware and it crashes immediately. In a real execution you will observe a lot of output data. ## Exercises - Demonstration In this point, we assume that you have cloned the github repository Clone the Git repository of the training: ```bash $ git clone https://github.com/csc-training/hip.git $ cd hip $ export rootdir=$PWD ``` Acknowledgment: Some exercises were provided by Cristian-Valise Achim, Jussi Enkovaara, AMD, and found online. ### Exercise: SAXPY CUDA #### Steps SAXPY is used for Single-Precision A*X Plus Y. It combines a scalar multiplication and vector addition. ```bash cd ${rootdir}/porting/codes/saxpy/cuda ``` ##### Check the file saxpy.cu with an editor ```c= #include <stdio.h> __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 1<<30; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); cudaFree(d_x); cudaFree(d_y); free(x); free(y); } ``` ##### Compile CUDA code ```bash make clean make ``` ##### Submit ```bash sbatch sub.sh ``` Check the files out_* and error_* The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0. #### Hipify ```bash cp Makefile saxpy.cu sub.sh ../hip/ cd ../hip ``` * Examine the hipify procedure ```bash= module load hip/4.0.0c hipexamine-perl.sh saxpy.cu info: converted 14 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:7 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:3 define:0 extern_shared:0 kernel_launch:1 ) warn:0 LOC:42 in 'saxpy.cu' hipMemcpy 3 hipFree 2 hipMemcpyHostToDevice 2 hipMalloc 2 hipLaunchKernelGGL 1 hipMemcpyDeviceToHost 1 ``` There is no warning, thus all the code can be hipified. ```bash hipify-perl --inplace saxpy.cu ``` The file saxpy.cu is hipified: ```c= #include "hip/hip_runtime.h" #include <stdio.h> __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 1<<30; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); hipMalloc(&d_x, N*sizeof(float)); hipMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice); // Perform SAXPY on 1M elements hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y); hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); hipFree(d_x); hipFree(d_y); free(x); free(y); } ``` 14 CUDA calls were converted with no errors. ##### Compile * Edit the Makefile and change the _nvcc_ to _hipcc_ ```cmake CC = nvcc ``` Modify to ```cmake CC = hipcc ``` * Clean and compile ```bash make clean make ``` * Submit ```bash sbatch sub.sh ``` Check the files out_* and error_* The error output includes the duration for the execution which is close to 7.32 seconds and the out_* file includes the max error which should be 0. The overhead seems to be close to 3%. The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hip_solution ##### If you want to chance the .cu file to .cpp ```bash mv saxpy.cu saxpy.cpp ``` * Edit Makefile * HIP Makefile with .cu ```cmake= # Compiler can be set below, or via environment variable CC = hipcc OPTIMIZE = yes # #=============================================================================== # Program name & source code list #=============================================================================== program = saxpy source = saxpy.cu obj = $(source:.cu=.o) #=============================================================================== # Sets Flags #=============================================================================== # Standard Flags CFLAGS := -Xcompiler -Wall # Linker Flags LDFLAGS = # Optimization Flags ifeq ($(OPTIMIZE),yes) CFLAGS += -O3 endif #=============================================================================== # Targets to Build #=============================================================================== # $(program): $(obj) Makefile $(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS) %.o: %.cu Makefile $(CC) $(CFLAGS) -c $< -o $@ ``` * HIP Makefile with .cpp ```cmake= # Compiler can be set below, or via environment variable CC = hipcc HIP_CU = hipcc --x cu OPTIMIZE = yes # #=============================================================================== # Program name & source code list #=============================================================================== program = saxpy source = saxpy.cpp obj = $(source:.cpp=.o) #=============================================================================== # Sets Flags #=============================================================================== # Standard Flags CFLAGS := -Xcompiler -Wall # Linker Flags LDFLAGS = # Optimization Flags ifeq ($(OPTIMIZE),yes) CFLAGS += -O3 endif #=============================================================================== # Targets to Build #=============================================================================== # $(program): $(obj) Makefile $(CC) $(CFLAGS) $(obj) -o $@ $(LDFLAGS) %.o: %.cpp Makefile $(HIP_CU) $(CFLAGS) -c $< -o $@ ``` ### Lessons learned: * Replace __nvcc__ with __hipcc__ in the Makefile * Hipify in-place with `hipify-perl --inplace filename` * For NVIDIA system, if the HIP code is in a file with extension __.cpp__ use __hipcc --x cu__ instead of __hipcc__ ### Exercise: SAXPY CUBLAS #### Steps SAXPY but using cuBLAS ```bash cd ${rootdir}/porting/codes/saxpy/cublas ``` ##### Check the file with an editor ```cpp= #include <iostream> #include "cublas_v2.h" using namespace std; int N = 1 << 30; int main(){ float *a_h, *b_h; a_h = new float[N]; b_h = new float[N]; float *a_d, *b_d; for(int i = 0; i < N; i++){ a_h[i] = 1.0f; b_h[i] = 2.0f ; } cublasHandle_t handle; cublasCreate(&handle); cudaMalloc((void**) &a_d, sizeof(float) * N); cudaMalloc((void**) &b_d, sizeof(float) * N); cublasSetVector( N, sizeof(float), a_h, 1, a_d, 1); cublasSetVector( N, sizeof(float), b_h, 1, b_d, 1); const float s = 2.0f; cublasSaxpy( handle, N, &s, a_d, 1, b_d, 1); cublasGetVector( N, sizeof(float), b_d, 1, b_h, 1); cudaFree(a_d); cudaFree(b_d); cublasDestroy(handle); float maxError = 0.0f; for(int i = 0; i < N; i++) maxError = max(maxError, abs(b_h[i]-4.0f)); cout << "Max error: " << maxError << endl; delete[] a_h; delete[] b_h; return 0; } ``` ##### Compile CUDA code ```bash make clean make ``` ##### Submit ```bash sbatch sub.sh ``` Check the files out_* and error_* The error output includes the duration for the execution which is close to 7.1 seconds and the out_* file includes the max error which should be 0. #### Hipify * Examine the hipify procedure ```bash= make clean cp * ../hipblas cd ../hipblas module load hip/4.0.0 hipexamine-perl.sh saxpy_cublas.cu info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:6 device_library:0 device_function:0 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:39 in 'saxpy_cublas.cu' hipFree 2 hipMalloc 2 ``` We observe the there are 6 library calls that will be converted. Hipify the code: ```bash hipify-perl --inplace saxpy_cublas.cu ``` Now the code is: ```cpp= #include <iostream> #include "hipblas.h" using namespace std; const int N = 1 << 30; int main(){ float *a_h, *b_h; a_h = new float[N]; b_h = new float[N]; float *a_d, *b_d; for(int i = 0; i < N; i++){ a_h[i] = 1.0f; b_h[i] = 2.0f ; } hipblasHandle_t handle; hipblasCreate(&handle); hipMalloc((void**) &a_d, sizeof(float) * N); hipMalloc((void**) &b_d, sizeof(float) * N); hipblasSetVector( N, sizeof(float), a_h, 1, a_d, 1); hipblasSetVector( N, sizeof(float), b_h, 1, b_d, 1); const float s = 2.0f; hipblasSaxpy( handle, N, &s, a_d, 1, b_d, 1); hipblasGetVector( N, sizeof(float), b_d, 1, b_h, 1); hipFree(a_d); hipFree(b_d); hipblasDestroy(handle); float maxError = 0.0f; for(int i = 0; i < N; i++) maxError = max(maxError, abs(b_h[i]-4.0f)); cout << "Max error: " << maxError << endl; delete[] a_h; delete[] b_h; return 0; } ``` #### Compile * Modify the Makefile to: ```cmake= ... CC = hipcc ... CFLAGS := -Xcompiler -Wall -I/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/include ... LDFLAGS = -L/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/ -lhipblas ... ``` * Define variables to find the hipBLAS header and library and compile ```bash= export LD_LIBRARY_PATH=/appl/opt/rocm/rocm-4.0.0c/hipblas/hipblas/lib/:$LD_LIBRARY_PATH ``` Load the custom installation of ROCm ```bash= module load hip/4.0.0c make clean make ``` * Submit your job script ```bash sbatch sub.sh ``` * Check the out* and error* files. The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/saxpy/hipblas_solution ### Lessons learned: * Always link with the appropriate library when it is available * Do not forget to declare the _LD_LIBRARY_PATH_ environment variable * Adjust the Makefile ### Exercise: Discrete_Hankel_Transform Description: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform #### Steps ```bash cd ${rootdir}/porting/codes/Discrete_Hankel_Transform/cuda ``` ##### Compile and Execute ```bash nvcc -arch=sm_70 -o code Code.cu sbatch sub.sh ``` #### Hipify, Compile, and Execute ```bash= cp * ../hip/ cd ../hip/ $ hipexamine-perl.sh Code.cu info: converted 46 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:24 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:11 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:9 define:0 extern_shared:0 kernel_launch:2 ) warn:0 LOC:220 in 'Code.cu' hipMalloc 9 hipMemcpy 9 hipMemcpyHostToDevice 7 hipFree 3 hipLaunchKernelGGL 2 hipMemGetInfo 2 hipMemcpyDeviceToHost 2 hipMemset 1 $ hipify-perl --inplace Code.cu $ ls bessel_zeros.in Code.cu Code.cu.prehip README.md sub.sh hipcc -arch=sm_70 -o code Code.cu sbatch sub.sh ``` The solution is here: https://github.com/csc-training/hip/tree/main/porting/codes/Discrete_Hankel_Transform/hip_solution ### Exercise: Heat Equation Example implementations of two dimensional heat equation. #### CUDA ```bash cd ${rootdir}/porting/codes/heat-equation/cuda ``` * Load CUDA aware MPI ```bash module load openmpi ``` * Compile and execute ```bash= make sbatch sub.sh ``` * Check the out* file, for example: ```bash= cat out_5003895 Average temperature at start: 59.763305 Iteration took 0.179 seconds. Average temperature: 59.281239 Reference value with default arguments: 59.281239 ``` #### Hipify ```bash= make clean mkdir ../hip cp *.cpp *.h *.cu ../hip/ cd ../hip hipexamine-perl.sh info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 ) warn:0 LOC:90 in './core_cuda.cu' info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:200 in './setup.cpp' info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 ) warn:0 LOC:702 kernels (1 total) : evolve_kernel(1) hipMemcpy 4 hipMemcpyHostToDevice 3 hipMalloc 2 hip_runtime_api 2 hipGetDeviceCount 1 hipDeviceSynchronize 1 hipLaunchKernelGGL 1 hipSetDevice 1 hipMemcpyDeviceToHost 1 hipconvertinplace-perl.sh . info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 ) warn:0 LOC:90 in './core_cuda.cu' info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:2 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:200 in './setup.cpp' info: TOTAL-converted 16 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:2 include_cuda_main_header:0 type:0 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:1 ) warn:0 LOC:702 kernels (1 total) : evolve_kernel(1) hipMemcpy 4 hipMemcpyHostToDevice 3 hipMalloc 2 hip_runtime_api 2 hipGetDeviceCount 1 hipDeviceSynchronize 1 hipLaunchKernelGGL 1 hipSetDevice 1 hipMemcpyDeviceToHost 1 ls core.cpp core_cuda.cu heat.h io.cpp main.cpp Makefile setup.cpp sub2.sh utilities.cpp core.cpp.prehip core_cuda.cu.prehip heat.h.prehip io.cpp.prehip main.cpp.prehip Makefile_orig setup.cpp.prehip sub.sh utilities.cpp.prehip ``` #### Update the Makefile * Original Makefile ```cmake= ifeq ($(COMP),) COMP=gnu endif ... ifeq ($(COMP),gnu) CXX=mpicxx CC=gcc NVCC=nvcc NVCCFLAGS=-g -O3 -I$(COMMONDIR) CCFLAGS=-g -O3 -Wall -I$(COMMONDIR) LDFLAGS= LIBS=-lpng -lcudart endif EXE=heat_cuda OBJS=main.o core.o core_cuda.o setup.o utilities.o io.o OBJS_PNG=$(COMMONDIR)/pngwriter.o ... $(EXE): $(OBJS) $(OBJS_PNG) $(CXX) $(CCFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS) %.o: %.cpp $(CXX) $(CCFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CCFLAGS) -c $< -o $@ %.o: %.cu $(NVCC) $(NVCCFLAGS) -c $< -o $@ ``` * Tips * Use __hipcc__ to compile the code with HIP calls * __hipcc__ can add the necessary options to link with the default libraries that are required (not the MPI etc.) * New Makefile with regards that we use __nvcc__ under the __hipcc__ ```cmake= ifeq ($(COMP),) COMP=hipcc endif ... ifeq ($(COMP),hipcc) CXX=hipcc CC=gcc NVCC=hipcc --x cu NVCCFLAGS=-g -O3 -I$(COMMONDIR) CXXFLAGS=-g -O3 -Xcompiler -Wall -I$(COMMONDIR) CCFLAGS=-g -O3 -Wall -I$(COMMONDIR) LDFLAGS= LIBS=-lpng -lmpi endif ... $(EXE): $(OBJS) $(OBJS_PNG) $(CXX) $(CXXFLAGS) $(OBJS) $(OBJS_PNG) -o $@ $(LDFLAGS) $(LIBS) %.o: %.cpp $(CXX) $(CXXFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CCFLAGS) -c $< -o $@ %.o: %.cu $(NVCC) $(NVCCFLAGS) -c $< -o $@ ``` * Compile and Execute ```bash= make sbatch sub.sh ``` ### Exercise: 2D WAVE Propagation 2D Wave Propagation #### CUDA and CPU The 2D Wave Propagation case was provided by Ludovic Rass ```bash= cd ${rootdir}/porting/codes/wave_2d/cuda_cpu ls compile.sh sub.sh vizme2D.m Wave_2D.c Wave_2D.cu ``` The file _Wave_2D.c_ is for CPU and the _Wave_2D.cu_ is for GPU. * Compile and Submit ```bash= cat compile.sh #!/bin/bash g++ -O3 Wave_2D.c -o wcpu nvcc -arch=sm_70 -O3 Wave_2D.cu -o wgpu ./compile.sh sbatch sub.sh ``` * Check the out* file ```bash= cat out_5015029 Perf: 220 iterations took 7.392e-03 seconds @ 32.9915 GB/s. Process uses GPU with id 0 . Perf: 220 iterations took 3.312e-03 seconds @ 73.6352 GB/s. ``` The CUDA code has 2.28 times better bandwidth. Of course, it depends on the problem size which in this case seems small. ##### HIP There is a script to compile the code in the _hip_ folder already. Copy the CUDA file to the _../hip_ directory ```bash cp *.cu sub.sh ../hip cd ../hip ``` ##### Hipify ```bash= hipify-perl --print-stats --inplace Wave_2D.cu info: converted 28 CUDA->HIP refs ( error:2 init:0 version:0 device:9 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:1 type:1 literal:0 numeric_literal:4 define:0 extern_shared:0 kernel_launch:3 ) warn:0 LOC:126 in 'Wave_2D.cu' hipDeviceReset 3 hipDeviceSynchronize 3 hipLaunchKernelGGL 3 hipMemcpy 2 hipGetDevice 1 hipError_t 1 hipFree 1 hipDeviceSetCacheConfig 1 hipMalloc 1 hip_runtime 1 hipSetDevice 1 hipMemcpyDeviceToHost 1 hipMemcpyHostToDevice 1 hipSuccess 1 hipGetErrorString 1 hipGetLastError 1 hipFuncCachePreferL1 1 ``` * Compile and submit Before you proceed with the submission, edit the `sub.sh` and comment the srun command to execute the CPU executable ```bash= ./compile.sh sbatch sub.sh ``` * From the output file ```bash= cat out_* Process uses GPU with id 0 . Perf: 220 iterations took 3.385e-03 seconds @ 72.0481 GB/s. ``` The HIP version provides similar results to the CUDA version with a small overhead. ### Exercise: KMeans Parallel k-means clustering code #### CUDA ```bash= cd ${rootdir}porting/codes/kmeans/cuda ls cuda_io.cu cuda_kmeans.cu cuda_main.cu cuda_wtime.cu Image_data kmeans.h LICENSE Makefile README sample.output sub.sh ``` * Compile and Execute ```bash= make cuda sbatch sub.sh ``` * We can check the out* and error* files ```bash= Writing coordinates of K=128 cluster centers to file "Image_data/color17695.bin.cluster_centres" Writing membership of N=17695 data objects to file "Image_data/color17695.bin.membership" Performing **** Regular Kmeans (CUDA version) **** Input file: Image_data/color17695.bin numObjs = 17695 numCoords = 9 numClusters = 128 threshold = 0.0010 Loop iterations = 131 I/O time = 0.0529 sec Computation timing = 0.2059 sec ``` #### HIP * Copy the data to the _../hip_ directory ```bash= cp -r *.cu *.h Image_data ../hip cd ../hip ``` * Hipify ```bash= hipconvertinplace-perl.sh . info: converted 28 CUDA->HIP refs ( error:0 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:5 define:0 extern_shared:2 kernel_launch:2 ) warn:0 LOC:372 in './cuda_kmeans.cu' info: converted 8 CUDA->HIP refs ( error:3 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:2 literal:0 numeric_literal:1 define:1 extern_shared:0 kernel_launch:0 ) warn:0 LOC:79 in './kmeans.h' info: TOTAL-converted 36 CUDA->HIP refs ( error:3 init:0 version:0 device:4 context:0 module:0 memory:13 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:1 include:1 include_cuda_main_header:0 type:3 literal:0 numeric_literal:6 define:1 extern_shared:2 kernel_launch:2 ) warn:0 LOC:843 kernels (1 total) : compute_delta(1) hipMemcpy 5 hipFree 4 hipMalloc 4 hipMemcpyHostToDevice 3 hipError_t 2 hipDeviceSynchronize 2 hipLaunchKernelGGL 2 hipGetErrorString 2 HIP_DYNAMIC_SHARED 2 hipMemcpyDeviceToHost 2 hipGetDevice 1 hipDeviceProp_t 1 hipGetDeviceProperties 1 hipSuccess 1 hipGetLastError 1 ``` * Compile and execute ```bash= make -f Makefile.hip sbatch sub.sh ``` * The output file ```bash= Performing **** Regular Kmeans (CUDA version) **** Input file: Image_data/color17695.bin numObjs = 17695 numCoords = 9 numClusters = 128 threshold = 0.0010 Loop iterations = 131 I/O time = 0.0081 sec Computation timing = 0.2000 sec ``` ### Exercise: Madgraph 4 GPU This code developed in the context of porting the MadGraph5_aMC@NLO event generator software onto GPU hardware. MadGraph5_aMC@NLO is able to generate code for various physics processes in different programming languages (Fortran, C, C++). ```bash= cd ${rootdir}/porting/codes/ mkdir madgraph4gpu cd madgraph4gpu wget https://github.com/madgraph5/madgraph4gpu/archive/master.zip unzip master.zip cd madgraph4gpu-master/ ls epoch0 epoch1 epoch2 README.md test tools cd epoch1/ cp -r cuda hip ``` #### Hipify ```bash= hipconvertinplace-perl.sh hip/ info: converted 7 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:3 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:4 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:841 in 'hip/gg_tt/SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cu' ... info: TOTAL-converted 294 CUDA->HIP refs ( error:11 init:0 version:0 device:3 context:0 module:0 memory:59 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:14 library:6 device_library:16 device_function:35 include:0 include_cuda_main_header:1 type:26 literal:0 numeric_literal:31 define:79 extern_shared:0 kernel_launch:13 ) warn:2 LOC:15920 warning: unconverted cudaTearDown : 2 kernels (2 total) : sigmaKin(3) gProc::sigmaKin(2) hipMemcpy 23 hipMemcpyToSymbol 19 hipMemcpyDeviceToHost 18 hipLaunchKernelGGL 18 hipFree 14 hipMalloc 11 hipPeekAtLastError 9 hipMemcpyHostToDevice 6 hipDeviceReset 5 hipSuccess 5 HIP_SYMBOL 5 hipError_t 4 hipGetErrorString 4 hipDoubleComplex 3 hip_runtime 3 hipHostFree 3 hipHostMalloc 3 hipMemcpy3D 3 hipFloatComplex 3 hipMemcpy2D 3 HIPRAND_STATUS_SUCCESS 2 HIPRAND_RNG_PSEUDO_PHILOX4_32_10 2 HIPRAND_RNG_PSEUDO_MT19937 2 HIPRAND_RNG_PSEUDO_MTGP32 2 HIPRAND_RNG_PSEUDO_XORWOW 2 HIPRAND_RNG_PSEUDO_MRG32K3A 2 hipComplex 1 hip_complex 1 hipCsubf 1 hipCmulf 1 hipCaddf 1 hipCdiv 1 hipCrealf 1 hipCsub 1 hipCreal 1 hipCimag 1 hipCmul 1 hipCadd 1 hipCimagf 1 hipCdivf 1 ``` * The warning is about a cuda variable deployed by the developers, so it is safe. The _hiprand_ is not working on our environment this moment, thus the utilization of this code requires the actual AMD hardware, however, it is ported. It requires tuning and checking. ### CUDA Fortran * As we discussed already, there is no straight forward approach with CUDA Fortran. * The HIP functions are callable from C and with `extern C` are callable from Fortran * Procedure: * Port CUDA Fortran code to HIP kernels in C++. The hipfort helps to call some HIP calls from Fortran. * Wrap the kernel launch in C function * Call the C function from Fortran through Fortran 2003 C binding, using pointers etc. ### Exercise: SAXPY CUDA Fortran We have the following example. SAXPY code in CUDA Fortran. In this case, to hipify the code, we follow this procedure. ```bash= $ cd ${rootdir}/porting/codes/cuda_fortran_saxpy/cuda $ ls main.cuf cat main.cuf ``` ```fortran= module mathOps contains attributes(global) subroutine saxpy(x, y, a) implicit none real :: x(:), y(:) real, value :: a integer :: i, n n = size(x) i = blockDim%x * (blockIdx%x - 1) + threadIdx%x if (i <= n) y(i) = y(i) + a*x(i) end subroutine saxpy end module mathOps program testSaxpy use mathOps use cudafor implicit none integer, parameter :: N = 40000 real :: x(N), y(N), a real, device :: x_d(N), y_d(N) type(dim3) :: grid, tBlock tBlock = dim3(256,1,1) grid = dim3(ceiling(real(N)/tBlock%x),1,1) x = 1.0; y = 2.0; a = 2.0 x_d = x y_d = y call saxpy<<<grid, tBlock>>>(x_d, y_d, a) y = y_d write(*,*) 'Max error: ', maxval(abs(y-4.0)) end program testSaxpy ``` * Compile the code and submit ```bash= ./compile.sh sbatch sub.sh ``` * Check the out* and error* files ```bash= cat out_* Max error: 0.000000 cat error* real 0m0.404s ... ``` #### Create the HIP kernel * Original kernel ```fortran= i = blockDim%x * (blockIdx%x - 1) + threadIdx%x if (i <= n) y(i) = y(i) + a*x(i) ``` * HIP kernel ```cpp= __global__ void saxpy(float *y, float *x, float a, int n) { size_t i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n) y[i] = y[i] + a*x[i]; } ``` ##### Observations * \__global__ means that the function will be executed on the GPU and it will be called from the host * In Fortran the variables such as *blockDim%x* are used in C/C++ as *blockDim.x*. This means that you have to change all these variables but a find and replace through __sed__ could be easy * Using arrays also is different for example `y(i)` becomes `y[i]` which again __sed__ could help * Overall we need to be careful that we do not do any mistake, always check the results #### Wrap the kernel launch in C function ```c= extern "C" { void launch(float **dout, float **da, float db, int N) { dim3 tBlock(256,1,1); dim3 grid(ceil((float)N/tBlock.x),1,1); hipLaunchKernelGGL((saxpy), grid, tBlock, 0, 0, *dout, *da, db, N); } } ``` #### The new Fortran 2003 code ```fortran= program testSaxpy use iso_c_binding use hipfort use hipfort_check implicit none interface subroutine launch(y,x,b,N) bind(c) use iso_c_binding implicit none type(c_ptr) :: y,x integer, value :: N real, value :: b end subroutine end interface type(c_ptr) :: dx = c_null_ptr type(c_ptr) :: dy = c_null_ptr integer, parameter :: N = 40000 integer, parameter :: bytes_per_element = 4 integer(c_size_t), parameter :: Nbytes = N*bytes_per_element real, allocatable,target,dimension(:) :: x, y real, parameter :: a=2.0 real :: x_d(N), y_d(N) call hipCheck(hipMalloc(dx,Nbytes)) call hipCheck(hipMalloc(dy,Nbytes)) allocate(x(N)) allocate(y(N)) x = 1.0;y = 2.0 call hipCheck(hipMemcpy(dx, c_loc(x), Nbytes, hipMemcpyHostToDevice)) call hipCheck(hipMemcpy(dy, c_loc(y), Nbytes, hipMemcpyHostToDevice)) call launch(dy, dx, a, N) call hipCheck(hipDeviceSynchronize()) call hipCheck(hipMemcpy(c_loc(y), dy, Nbytes, hipMemcpyDeviceToHost)) write(*,*) 'Max error: ', maxval(abs(y-4.0)) call hipCheck(hipFree(dx)) call hipCheck(hipFree(dy)) deallocate(x) deallocate(y) end program testSaxpy ``` #### Makefile * hipfort provides a Makefile called _Makefile.hipfort_ ```cmake= export HIPFORT_HOME=${ROCM_PATH}/hipfort/ include ${HIPFORT_HOME}/bin/Makefile.hipfort OUTPUT_DIR ?= $(PWD) APP = $(OUTPUT_DIR)/saxpy .DEFAULT_GOAL := all all: $(APP) $(APP): $(OUTPUT_DIR)/main.o $(OUTPUT_DIR)/hipsaxpy.o $(FC) $^ $(LINKOPTS) -o $(APP) $(OUTPUT_DIR)/main.o: main.f03 $(FC) -c $^ -o $(OUTPUT_DIR)/main.o $(OUTPUT_DIR)/hipsaxpy.o: hipsaxpy.cpp $(CXX) --x cu -c $^ -o $(OUTPUT_DIR)/hipsaxpy.o clean: rm -f $(APP) *.o *.mod *~ ``` __Tip:__ Not sure how safe it is but if all your cpp files had HIP calls under NVIDIA system, you could define `export HIPCC_COMPILE_FLAGS_APPEND="--x cu"` and not to modify the Makefile. Be careful as this can break something else. * Compile and submit ```bash= module load hip/4.0.0c make submit sub.sh ``` ## Gromacs GROMACS is a molecular dynamics package mainly designed for simulations of proteins, lipids, and nucleic acids. Do not follow these instructions as it could take a long time, they are documented to help you in your case ### Download Gromacs and uncompress: ```bash= wget https://ftp.gromacs.org/gromacs/gromacs-2021.tar.gz tar zxvf gromacs-2021.tar.gz cd gromacs-2021 ls admin api AUTHORS build cmake CMakeLists.txt computed_checksum COPYING CPackInit.cmake CTestConfig.cmake docs INSTALL python_packaging README scripts share src tests ``` ### Hipify Let's hipify the application automatically with the __hipconvertinplace-perl.sh__ script ```bash= cd src hipconvertinplace-perl.sh . info: converted 10 CUDA->HIP refs ( error:0 init:0 version:0 device:3 context:0 module:0 memory:4 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:1 literal:0 numeric_literal:2 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:89 in './gromacs/gpu_utils/tests/devicetransfers.cu' info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:2 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:2 literal:0 numeric_literal:8 define:1 extern_shared:0 kernel_launch:0 ) warn:0 LOC:126 in './gromacs/gpu_utils/pinning.cu' info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:6 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:3 literal:0 numeric_literal:0 define:3 extern_shared:0 kernel_launch:0 ) warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu' warn:0 LOC:113 in './gromacs/gpu_utils/pmalloc_cuda.cu' warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr)); warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 : isPinned = (memoryAttributes.type == cudaMemoryTypeHost); warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 : * \param[in] cudaCallName name of CUDA peer access call warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 : const char* cudaCallName) warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 : gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB); warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 : gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str()); info: converted 32 CUDA->HIP refs ( error:1 init:0 version:0 device:6 context:0 module:0 memory:0 virtual_memory:0 addressing:2 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:1 include_cuda_main_header:0 type:7 literal:0 numeric_literal:7 define:0 extern_shared:0 kernel_launch:0 ) warn:9 LOC:239 in './gromacs/gpu_utils/gpu_utils.cu' info: converted 13 CUDA->HIP refs ( error:0 init:0 version:0 device:1 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:5 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:4 literal:0 numeric_literal:2 define:1 extern_shared:0 kernel_launch:0 ) warn:0 LOC:103 in './gromacs/gpu_utils/device_stream.cu' warning: ./gromacs/hardware/device_management.cu:#137 : // it is enough to check for cudaErrorDevicesUnavailable only here because warning: ./gromacs/hardware/device_management.cu:#139 : if (cu_err == cudaErrorDevicesUnavailable) ... info: converted 3 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:320 in './gromacs/ewald/pme_spread.cu' warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync": virxx += __shfl_down_sync(activeMask, virxx, 1, width); warning: ./gromacs/ewald/pme_solve.cu:261: unsupported device function "__shfl_up_sync": viryy += __shfl_up_sync(activeMask, viryy, 1, width); ... info: converted 1 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:0 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:0 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:0 define:1 extern_shared:0 kernel_launch:0 ) warn:0 LOC:2709 in './external/googletest/googletest/include/gtest/internal/gtest-port.h' info: TOTAL-converted 377 CUDA->HIP refs ( error:9 init:0 version:5 device:27 context:0 module:0 memory:29 virtual_memory:0 addressing:2 stream:18 event:27 external_resource_interop:0 stream_memory:0 execution:3 graph:0 occupancy:0 texture:3 surface:0 peer:4 graphics:0 profiler:4 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:18 device_library:0 device_function:38 include:1 include_cuda_main_header:5 type:81 literal:0 numeric_literal:85 define:11 extern_shared:7 kernel_launch:0 ) warn:45 LOC:869044 warning: unconverted cudaProfilerRun : 4 warning: unconverted cudaCallName : 4 warning: unconverted cudaErrorDevicesUnavailable : 2 warning: unconverted cudaMemoryTypeHost : 1 kernels (0 total) : hipError_t 47 hipSuccess 46 hipEventDestroy 8 hipMemcpyAsync 8 hipSetDevice 8 HIP_DYNAMIC_SHARED 7 hipFuncSetCacheConfig 6 hipErrorInvalidValue 5 ... hipStream_t 4 hipHostMalloc 4 hipGetDevice 4 hipStreamQuery 4 hip_runtime 4 hipStreamSynchronize 4 ... hipHostMallocDefault 2 hipProfilerStop 2 hipPointerAttribute_t 2 hipDeviceEnablePeerAccess 2 ... hipDeviceGetStreamPriorityRange 1 hipGetErrorName 1 hipPeekAtLastError 1 hipErrorInvalidDeviceFunction 1 HIPFFT_SUCCESS 1 ``` #### Issues 1. __Description:__ We should check all the warnings as these declare that something was not translated, it does not mean for sure that this is a mistake. If a developer declares a variable called cudaXXX the tool will report a warning but it is expected. ```bash= warning: ./gromacs/gpu_utils/gpu_utils.cu:#69 : static bool cudaProfilerRun = ((getenv("NVPROF_ID") != nullptr)); warning: ./gromacs/gpu_utils/gpu_utils.cu:#90 : isPinned = (memoryAttributes.type == cudaMemoryTypeHost); warning: ./gromacs/gpu_utils/gpu_utils.cu:#114 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#126 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#143 : if (cudaProfilerRun) warning: ./gromacs/gpu_utils/gpu_utils.cu:#154 : * \param[in] cudaCallName name of CUDA peer access call warning: ./gromacs/gpu_utils/gpu_utils.cu:#160 : const char* cudaCallName) warning: ./gromacs/gpu_utils/gpu_utils.cu:#165 : gmx::formatString("%s from GPU %d to GPU %d failed", cudaCallName, gpuA, gpuB); warning: ./gromacs/gpu_utils/gpu_utils.cu:#175 : gpuA, gpuB, cudaCallName, gmx::getDeviceErrorString(stat).c_str()); ``` ```bash= warning: ./gromacs/hardware/device_management.cu:#137 : // it is enough to check for cudaErrorDevicesUnavailable only here because warning: ./gromacs/hardware/device_management.cu:#139 : if (cu_err == cudaErrorDevicesUnavailable) ``` ```bash= warn:45 LOC:869044 warning: unconverted cudaProfilerRun : 4 warning: unconverted cudaCallName : 4 warning: unconverted cudaErrorDevicesUnavailable : 2 warning: unconverted cudaMemoryTypeHost : 1 ``` #### Solution: Check the files with the warnings: * For example the warning of cudaProfilerRun is not actually a serious issue as it is a variable declared by the developers ```bash static bool cudaProfilerRun ``` * Similar for ```bash const char* cudaCallName ``` * About _cudaErrorDevicesUnavailable_ we can see from the web page https://rocmdocs.amd.com/en/latest/Programming_Guides/CUDAAPIHIPTEXTURE.html that there is no specific HIP call. ```bash= if (cu_err == cudaErrorDevicesUnavailable) { return DeviceStatus::Unavailable; } ``` However, if we try to track the call path from the code, we can see: ```bash= cu_err == cudaErrorDevicesUnavailable) └── cu_err = checkCompiledTargetCompatibility(deviceInfo.id, deviceInfo.prop); └── static cudaError_t checkCompiledTargetCompatibility(int deviceId, const cudaDeviceProp& deviceProp) { cudaFuncAttributes attributes; cudaError_t stat = cudaFuncGetAttributes(&attributes, dummy_kernel); ... return stat; } ``` Thus, the returned value is from the call to __cudaFuncAttributes__. From the PDF of HIP API which is more updated from the web site, we can see for example, for v4.0, see here: https://github.com/RadeonOpenCompute/ROCm/blob/master/HIP-API_Guide_v4.0.pdf where we can find that ```bash= 4.8.2.11 hipFuncGetAttributes() Returns hipSuccess, hipErrorInvalidValue, hipErrorInvalidDeviceFunction ``` Thus we can replace in the file _.gromacs/hardware/device_management.cu_ manually the __cudaErrorDevicesUnavailable__ with __hipErrorInvalidDeviceFunction__ * About the warning: ```bash warning: unconverted cudaMemoryTypeHost : 1 ``` We can see from the HIP programming guide PDF that there is __hipMemoryTypeHost__ and we are not sure why it was not converted, it could be a bug, so we manually do the appropriate modification in the file ./gromacs/gpu_utils/gpu_utils.cu 2. __Description:__ The Warp-Level primitives are not supported by HIP (yet) ```bash warning: ./gromacs/ewald/pme_solve.cu:260: unsupported device function "__shfl_down_sync": virxx += __shfl_down_sync(activeMask, virxx, 1, width); ``` Github issue: https://github.com/ROCm-Developer-Tools/HIP/issues/1491 __Solution:__ Change the calls of __shfl\_*\_sync__ to __shfl\_*__ for example __shfl_down_sync to __shfl_down 3. __Description:__ Gromacs uses FFT __Solution:__ The hipFFT seems not to be able to compile on NVIDIA systems, it is possible with the hip/4.0.0 but the explanation is different. 4. __Description:__ Error: _src/gromacs/utility/cuda_version_information.cu(49): error: identifier "hipDriverGetVersion" is undefined_ Code: ```cpp= #include "gmxpre.h" #include "cuda_version_information.h" #include "gromacs/utility/stringutil.h" namespace gmx { std::string getCudaDriverVersionString() { int cuda_driver = 0; if (hipDriverGetVersion(&cuda_driver) != hipSuccess) { return "N/A"; } return formatString("%d.%d", cuda_driver / 1000, cuda_driver % 100); } std::string getCudaRuntimeVersionString() { int cuda_runtime = 0; if (hipRuntimeGetVersion(&cuda_runtime) != hipSuccess) { return "N/A"; } return formatString("%d.%d", cuda_runtime / 1000, cuda_runtime % 100); } } // namespace gmx ``` __Solution:__ Add `#include "hip/hip_runtime.h"` ##### Compilation This is an example it does not mean that this is the best way ```bash= CXX=/appl/opt/rocm/rocm-4.0.0/hip/bin/hipcc cmake -DGMX_GPU=CUDA .. make ``` ## CMAKE One cmake file, called `cmake/gmxManageNvccConfig.cmake` from Gromacs is the following: ```cmake= # # This file is part of the GROMACS molecular simulation package. # # Copyright (c) 2012,2013,2014,2015,2016 by the GROMACS development team. # Copyright (c) 2017,2018,2019,2020, by the GROMACS development team, led by # Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl, # and including many others, as listed in the AUTHORS file in the # top-level source directory and at http://www.gromacs.org. ... # set up host compiler and its options if(CUDA_HOST_COMPILER_CHANGED) set(CUDA_HOST_COMPILER_OPTIONS "") if(APPLE AND CMAKE_C_COMPILER_ID MATCHES "GNU") # Some versions of gcc-4.8 and gcc-4.9 have produced errors # (in particular on OS X) if we do not use # -D__STRICT_ANSI__. It is harmless, so we might as well add # it for all versions. list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__") endif() work_around_glibc_2_23() set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}" CACHE STRING "Options for nvcc host compiler (do not edit!).") mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS) endif() if (GMX_CUDA_TARGET_SM OR GMX_CUDA_TARGET_COMPUTE) set(GMX_CUDA_NVCC_GENCODE_FLAGS) set(_target_sm_list ${GMX_CUDA_TARGET_SM}) foreach(_target ${_target_sm_list}) list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=sm_${_target}") endforeach() set(_target_compute_list ${GMX_CUDA_TARGET_COMPUTE}) foreach(_target ${_target_compute_list}) list(APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_${_target},code=compute_${_target}") endforeach() else() if(CUDA_VERSION VERSION_LESS "11.0") list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30") endif() list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35") ... list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_70,code=compute_70") if(NOT CUDA_VERSION VERSION_LESS "10.0") list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_75,code=compute_75") endif() if(NOT CUDA_VERSION VERSION_LESS "11.0") list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_80,code=compute_80") endif() endif() if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32) message(STATUS "Check for working NVCC/C++ compiler combination with nvcc '${CUDA_NVCC_EXECUTABLE}'") execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${CUDA_NVCC_FLAGS} ${CUDA_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu RESULT_VARIABLE _cuda_test_res OUTPUT_VARIABLE _cuda_test_out ERROR_VARIABLE _cuda_test_err OUTPUT_STRIP_TRAILING_WHITESPACE) ... endif() # GMX_CHECK_NVCC macro(GMX_SET_CUDA_NVCC_FLAGS) set(CUDA_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${CUDA_NVCC_FLAGS}") endmacro() function(gmx_cuda_add_library TARGET) add_definitions(-DHAVE_CONFIG_H) # Source files generated by NVCC can include gmxmpi.h, and so # need access to thread-MPI. include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external/thread_mpi/include) # Source files can also contain topology related files and need access to # the remaining external headers include_directories(SYSTEM ${PROJECT_SOURCE_DIR}/src/external) # Now add all the compilation options gmx_cuda_target_compile_options(CUDA_${TARGET}_CXXFLAGS) list(APPEND CMAKE_CXX_FLAGS ${CUDA_${TARGET}_CXXFLAGS}) foreach(build_type ${build_types_with_explicit_flags}) list(APPEND CMAKE_CXX_FLAGS_${build_type} ${CUDA_${TARGET}_CXXFLAGS_${build_type}}) endforeach() cuda_add_library(${TARGET} ${ARGN}) endfunction() ``` ### Hipify ```bash= hipify-cmakefile cmake/gmxManageNvccConfig.cmake > cmake/gmxManageHipConfig.cmake warning: cmake/gmxManageNvccConfig.cmake:#38 : unsupported macro/option : # - use the CUDA_HOST_COMPILER if defined by the user, otherwise warning: cmake/gmxManageNvccConfig.cmake:#39 : unsupported macro/option : # - check if nvcc works with CUDA_HOST_COMPILER and the generated nvcc and C++ flags warning: cmake/gmxManageNvccConfig.cmake:#42 : unsupported macro/option : # * CUDA_HOST_COMPILER_OPTIONS - the full host-compiler related option list passed to nvcc warning: cmake/gmxManageNvccConfig.cmake:#44 : unsupported macro/option : # Note that from CMake 2.8.10 FindCUDA defines CUDA_HOST_COMPILER internally, warning: cmake/gmxManageNvccConfig.cmake:#59 : unsupported macro/option : list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D_FORCE_INLINES") warning: cmake/gmxManageNvccConfig.cmake:#60 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS ${CUDA_HOST_COMPILER_OPTIONS} PARENT_SCOPE) warning: cmake/gmxManageNvccConfig.cmake:#64 : unsupported macro/option : gmx_check_if_changed(CUDA_HOST_COMPILER_CHANGED CUDA_HOST_COMPILER) warning: cmake/gmxManageNvccConfig.cmake:#67 : unsupported macro/option : if(CUDA_HOST_COMPILER_CHANGED) warning: cmake/gmxManageNvccConfig.cmake:#68 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS "") warning: cmake/gmxManageNvccConfig.cmake:#75 : unsupported macro/option : list(APPEND CUDA_HOST_COMPILER_OPTIONS "-D__STRICT_ANSI__") warning: cmake/gmxManageNvccConfig.cmake:#80 : unsupported macro/option : set(CUDA_HOST_COMPILER_OPTIONS "${CUDA_HOST_COMPILER_OPTIONS}" warning: cmake/gmxManageNvccConfig.cmake:#83 : unsupported macro/option : mark_as_advanced(CUDA_HOST_COMPILER CUDA_HOST_COMPILER_OPTIONS) warning: cmake/gmxManageNvccConfig.cmake:#178 : unsupported macro/option : list(APPEND GMX_CUDA_NVCC_FLAGS "${CUDA_HOST_COMPILER_OPTIONS}") warning: cmake/gmxManageNvccConfig.cmake:#210 : unsupported macro/option : if((_cuda_nvcc_executable_or_flags_changed OR CUDA_HOST_COMPILER_CHANGED OR NOT GMX_NVCC_WORKS) AND NOT WIN32) warning: cmake/gmxManageNvccConfig.cmake:#212 : unsupported macro/option : execute_process(COMMAND ${CUDA_NVCC_EXECUTABLE} -ccbin ${CUDA_HOST_COMPILER} -c ${HIP_NVCC_FLAGS} ${HIP_NVCC_FLAGS_${_build_type}} ${CMAKE_SOURCE_DIR}/cmake/TestCUDA.cu ``` The new cmake looks like: ```cmake= if(HIP_VERSION VERSION_LESS "11.0") list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_30,code=sm_30") endif() list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-gencode;arch=compute_35,code=sm_35") ... if(NOT HIP_VERSION VERSION_LESS "11.0") # Requesting sm or compute 35, 37, or 50 triggers deprecation messages with # nvcc 11.0, which we need to suppress for use in CI list (APPEND GMX_CUDA_NVCC_GENCODE_FLAGS "-Wno-deprecated-gpu-targets") endif() ... macro(GMX_SET_CUDA_NVCC_FLAGS) set(HIP_NVCC_FLAGS "${GMX_CUDA_NVCC_FLAGS};${HIP_NVCC_FLAGS}") endmacro() ``` * The tool will be improved and also probably you can add more variables using the file: /appl/opt/rocm/rocm-4.0.0c/hip/bin/hipify-cmakefile Also there will be many improvements regarding CMake https://github.com/ROCm-Developer-Tools/HIP/issues/2158#issuecomment-737222202 ## Exercises ### Vector Addition * Hipify the code in the repository: https://github.com/csc-training/hip/tree/main/porting/codes/Vector_Addition ## Known issues - Some HIP libraries need dependencies not available on NVIDIA platform, need to investigate. - If your CUDA kernel, includes the dim3() call, then hipify will convert wrongly. [Issue in Github](https://github.com/ROCm-Developer-Tools/HIPIFY/issues/246) . It was fixed on February 24th, not yet installed on Puhti - In CUDA, `__CUDACC__` is defined by `nvcc`, but the HIP equivalent `__HIPCC__` is defined in `hip_runtime.h`. Thus, if code uses `__CUDACC__` without `#include <cuda_runtime_api.h>`, one needs to add manually `#include <hip_runtime.h>` to have the automatically converted `__HIPCC__` to get defined. [Issue in Github](https://github.com/ROCm-Developer-Tools/HIP/issues/29) ## Feedback: * **Q**:Why would rocBLAS would not work for Nvidia if it’s written in HIP? **Answer**: hipBLAS is written for HIP and it works, rocBLAS is written in roc. * Start interactive jobs with `sinteractive -i -g 1`. This will ask for an interactive session and ask about the project. I do no recommend using this. * In the submission scripts `sub.sh` change to `#SBATCH --account=project_2000745` * **Q**: Would it be possible for CSC to prepare a HIP enabled container that can be used for CUDA->HIP porting? * **Q**: How are the multidimensional arrays handled in `hipfort`? Would one need to provide a flattened array?