The environment settings: ``` MPICH_GPU_SUPPORT_ENABLED=1 CRAY_ACC_FORCE_EARLY_INIT=1 # force init at program start of all devices FI_CXI_OPTIMIZED_MRS="false" MPICH_ABORT_ON_ERROR=1 FI_CXI_RX_MATCH_MODE=hybrid MPICH_SMP_SINGLE_COPY_MODE=NONE MPICH_ALLTOALL_INTRA_ALGORITHM=pairwise FI_CXI_EQ_ACK_BATCH_SIZE=1 FI_CXI_DEFAULT_CQ_SIZE=1024 FI_CXI_OFLOW_BUF_SIZE=20971520 MPICH_MAX_THREAD_SAFETY=multiple MPICH_OFI_NIC_POLICY=GPU ROCFFT_RTC_CACHE_PATH=/tmp # temporary fix for communication issues MPICH_ASYNC_PROGRESS=1 ``` Notes on some variables: - FI_CXI_DEFAULT_CQ_SIZE: From Intel docs: ``` In case you experience hangs when running with the CXI provider, or see messages about Cassini Event Queue overflow, try increasing the FI_CXI_DEFAULT_CQ_SIZE cvar to values ranging from 16384 to 131072. This is a known issue with the CXI provider. When using 4th Generation Intel® Xeon® Scalable Processors nodes in SNC4 mode, the default CPU pinning (and in turn the nic assignment) is not correct for multiples of 6 ranks and the default GPU pinning is not correct for multiples of 8 ranks. In such cases, it is recommended to explicitly specify CPU, GPU, and NIC pinning using cvars. ``` Has default value of 128K on Frontier. - FI_CXI_RX_MATCH_MODE ``` Evidently, the FI_CXI_RX_MATCH_MODE by default is 'hardware' on Frontier. Curious, have you tried 'hybrid' mode and did you face the same issues? 'hybrid' might be better as the switch to software mode is done on a rank by rank basis. I'd stick with "software" for now, especially on Crusher. Long term "hybrid" will probably be the ideal solution, but the transition from hardware to software matching still needs more testing. ``` Debug vars: ``` #MPICH_DBG=1 #MPICH_DBG_LEVEL to VERBOSE #MPICH_DBG_CLASS to ALL ```