Bare Metal, Docker and Singularity Performance Tests

# Bare Metal, Docker and Singularity Performance Tests ----- ## Test Environment * Hardware * CPU * 2 × Intel Xeon Silver 4114 * Memory * 4 × 32 GB SK HYNIX HMA84GR7CJR4N-XN * Storage * 2 × 480 GB SAMSUNG MZ7LH480HBHQ0D3 (RAID-1) * 2 × 1 TB SEAGATE ST1000NX0443 (RAID-1) * Network * BROADCOM BCM57416 Ethernet * NVIDIA Mellanox ConnectX-5 * Software * CentOS Linux 8.4.2105 * Fully Up-to-date * SELinux disabled * Compiler Toolchain * Package group "Development Tools" installed * gcc 8.4.1 20200928 (Red Hat 8.4.1-1) * g++ 8.4.1 20200928 (Red Hat 8.4.1-1) * gfortran 8.4.1 20200928 (Red Hat 8.4.1-1) * go 1.16.6 * NVIDIA MLNX OFED 5.4-1.0.3.0 * Open MPI 4.0.6 * UCX 1.11.0 * Docker 20.10.7 * Singularity 3.8.1 ----- ## Test Item 1: High Performance Linpack (HPL) * OpenBLAS 0.3.17 Build notes: ```bash # Single-threaded (Choose this for MPI applications) make TARGET=HASWELL DYNAMIC_ARCH=0 CC=gcc FC=gfortran USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0 NO_WARMUP=1 NO_AFFINITY=1 COMMON_OPT=-O3 PREFIX=/opt/openblas-0.3.17/gnu-8.4.0 # Multi-threaded make TARGET=HASWELL DYNAMIC_ARCH=0 CC=gcc FC=gfortran USE_THREAD=1 USE_LOCKING=1 USE_OPENMP=0 NO_WARMUP=1 NO_AFFINITY=1 COMMON_OPT=-O3 PREFIX=/opt/openblas-0.3.17/gnu-8.4.0 ``` * HPL 2.3 Build notes: ```diff diff -Naurd setup/Make.Linux_ATHLON_FBLAS Make.Linux_Skylake_FBLAS --- setup/Make.Linux_ATHLON_FBLAS 1970-01-01 13:00:00.000000000 +0800 +++ Make.Linux_Skylake_FBLAS 2021-08-11 16:54:41.157664767 +0800 @@ -61,13 +61,13 @@ # - Platform identifier ------------------------------------------------ # ---------------------------------------------------------------------- # -ARCH = Linux_ATHLON_FBLAS +ARCH = Linux_Skylake_FBLAS # # ---------------------------------------------------------------------- # - HPL Directory Structure / HPL library ------------------------------ # ---------------------------------------------------------------------- # -TOPdir = $(HOME)/hpl +TOPdir = /home/alice/hpl-src/hpl-2.3 INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) @@ -81,9 +81,9 @@ # header files, MPlib is defined to be the name of the library to be # used. The variable MPdir is only used for defining MPinc and MPlib. # -MPdir = /usr/local/mpi +MPdir = /opt/openmpi-4.0.6/gnu-8.4.0 MPinc = -I$(MPdir)/include -MPlib = $(MPdir)/lib/libmpich.a +MPlib = $(MPdir)/lib/libmpi.so # # ---------------------------------------------------------------------- # - Linear Algebra library (BLAS or VSIPL) ----------------------------- @@ -92,9 +92,9 @@ # header files, LAlib is defined to be the name of the library to be # used. The variable LAdir is only used for defining LAinc and LAlib. # -LAdir = $(HOME)/netlib/ARCHIVES/Linux_ATHLON +LAdir = /opt/openblas-0.3.17/gnu-8.4.0 LAinc = -LAlib = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a +LAlib = $(LAdir)/lib/libopenblas.a # # ---------------------------------------------------------------------- # - F77 / C interface -------------------------------------------------- @@ -156,7 +156,7 @@ # *) call the BLAS Fortran 77 interface, # *) not display detailed timing information. # -HPL_OPTS = +HPL_OPTS = -DHPL_DETAILED_TIMING # # ---------------------------------------------------------------------- # @@ -166,15 +166,15 @@ # - Compilers / linkers - Optimization flags --------------------------- # ---------------------------------------------------------------------- # -CC = /usr/bin/gcc +CC = mpicc CCNOOPT = $(HPL_DEFS) -CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall +CCFLAGS = $(HPL_DEFS) -march=haswell -mtune=haswell -O3 -Wall -Wextra # -LINKER = /usr/bin/g77 +LINKER = mpicc LINKFLAGS = $(CCFLAGS) # ARCHIVER = ar -ARFLAGS = r +ARFLAGS = rcs RANLIB = echo # # ---------------------------------------------------------------------- ``` ```bash make arch=Linux_Skylake_FBLAS ``` For a dual-socket Intel Xeon Silver 4114 system, the theoretical peak double-precision floating-point performance is calculated to be **576.0** GFLOPS, as demonstrated below: 2 (Socket) × 10 (Core) × 1.8 (Frequency in GHz) × 1 (FMA Unit) × (2 × 8) (Double-precision FLOPS per Cycle) = **576.0** (GFLOPS) References: * [Intel Xeon Silver 4114 (Intel)](https://ark.intel.com/content/www/us/en/ark/products/123550/intel-xeon-silver-4114-processor-13-75m-cache-2-20-ghz.html) * [Intel Xeon Silver 4114 (Wikichip)](https://en.wikichip.org/wiki/intel/xeon_silver/4114) ### AVX512-enabled Build of OpenBLAS and HPL OpenBLAS and HPL were built to target and tune for `skylake-avx512`. Note: HPL was run on bare metal with only 16 cores. ```text ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 117120 NB : 64 PMAP : Row-major process mapping P : 4 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 117120 64 4 4 3587.95 2.9851e+02 HPL_pdgesv() start time Wed Aug 4 02:37:46 2021 HPL_pdgesv() end time Wed Aug 4 03:37:34 2021 --VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV- Max aggregated wall time rfact . . . : 4.54 + Max aggregated wall time pfact . . : 1.58 + Max aggregated wall time mxswp . . : 0.64 Max aggregated wall time update . . : 3581.40 + Max aggregated wall time laswp . . : 81.90 Max aggregated wall time up tr sv . : 0.96 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.35396531e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ``` HPL Efficiency: 298.51 ÷ 460.8 × 100% = **64.78%** ### AVX2-enabled Build of OpenBLAS and HPL OpenBLAS and HPL were built to target and tune for `haswell`. Note: HPL was run on bare metal with only 16 cores. ```text ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 117120 NB : 64 PMAP : Row-major process mapping P : 4 Q : 4 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 117120 64 4 4 2770.75 3.8656e+02 HPL_pdgesv() start time Wed Aug 11 17:27:50 2021 HPL_pdgesv() end time Wed Aug 11 18:14:01 2021 --VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV- Max aggregated wall time rfact . . . : 4.46 + Max aggregated wall time pfact . . : 1.32 + Max aggregated wall time mxswp . . : 0.43 Max aggregated wall time update . . : 2766.19 + Max aggregated wall time laswp . . : 137.95 Max aggregated wall time up tr sv . : 0.83 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.42427217e-03 ...... PASSED ================================================================================ Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================ ``` HPL Efficiency: 386.56 ÷ 460.8 × 100% = **83.89%** ### Configuration ```text HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) file device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 117120 Ns 1 # of NBs 64 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 5 # of process grids (P x Q) 4 4 4 4 4 Ps 4 4 4 4 4 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) ``` ### Results Due to the above, OpenBLAS and HPL were built to target and tune for `haswell` instead of `skylake-avx512`. Note: HPL was run with only 16 cores. HPL Run Type | HPL Performance (GFLOPS) -------------|------------------------- Bare Metal | **386.37 (Baseline)** \* \- | 388.20 \- | 386.00 \- | 386.02 \- | 385.85 \- | 385.78 Docker | **387.72 (+0.35%)** \* \- | ~~339.40~~ † \- | 388.55 \- | 388.58 \- | 388.02 \- | 386.63 \- | 386.82 Singularity | **387.36 (+0.26%)** \* \- | ~~338.80~~ † \- | 387.54 \- | 387.27 \- | 387.29 \- | 387.38 \- | 387.34 \* Average of 5 consecutive runs below. † After a bare metal host reboot, a value in this range can no longer be reproduced and is therefore discarded as an outlier. ----- ## Test Item 2: Weather Research and Forecasting (WRF) Model * WRF 4.2.2 * WPS 4.2 The "NCEP GDAS/FNL 0.25-Degree Global Tropospheric Analyses" dataset was used as the global model data for this simulation. ### Configuration The following namelist file was used: ```text &share wrf_core = 'ARW' max_dom = 3 start_date = '2021-07-21_00:00:00', '2021-07-21_00:00:00', '2021-07-21_00:00:00', '2021-07-21_00:00:00' end_date = '2021-07-25_00:00:00', '2021-07-25_00:00:00', '2021-07-25_00:00:00', '2021-07-25_00:00:00' interval_seconds = 21600 io_form_geogrid = 2 opt_output_from_geogrid_path = './' / &geogrid parent_id = 1, 1, 2, 3 parent_grid_ratio = 1, 3, 5, 3 i_parent_start = 1, 33, 63, 45 j_parent_start = 1, 23, 41, 33 e_we = 121, 172, 216, 331 e_sn = 105, 130, 226, 442 geog_data_res = 'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s', 'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s', 'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s', 'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s' dx = 45000 dy = 45000 map_proj = 'lambert' ref_lat = 27.5 ref_lon = 121.0 truelat1 = 30.0 truelat2 = 25.0 stand_lon = 121.0 geog_data_path = '/mnt/geographic-data' opt_geogrid_tbl_path = './geogrid/' / &ungrib out_format = 'WPS' prefix = 'FILE' / &metgrid fg_name = 'FILE' io_form_metgrid = 2 opt_output_from_metgrid_path = './' opt_metgrid_tbl_path = './metgrid/' / &time_control run_days = 0 run_hours = 96 run_minutes = 0 run_seconds = 0 start_year = 2021, 2021, 2021, 2021 start_month = 7, 7, 7, 7 start_day = 21, 21, 21, 21 start_hour = 0, 0, 0, 0 start_minute = 0, 0, 0, 0 start_second = 0, 0, 0, 0 end_year = 2021, 2021, 2021, 2021 end_month = 7, 7, 7, 7 end_day = 25, 25, 25, 25 end_hour = 0, 0, 0, 0 end_minute = 0, 0, 0, 0 end_second = 0, 0, 0, 0 interval_seconds = 21600 input_from_file = .true., .true., .true., .true. history_interval = 60, 60, 60, 60 frames_per_outfile = 24, 24, 24, 24 restart = .false. restart_interval = 10080 io_form_history = 2 io_form_restart = 2 io_form_input = 2 io_form_boundary = 2 auxinput4_inname = 'wrflowinp_d<domain>' auxinput4_interval = 360, 360, 360, 360 io_form_auxinput4 = 2 / &domains time_step = 180 time_step_fract_num = 0 time_step_fract_den = 1 max_dom = 3 e_we = 121, 172, 216, 331 e_sn = 105, 130, 226, 442 e_vert = 30, 30, 30, 30 p_top_requested = 5000 num_metgrid_levels = 34 num_metgrid_soil_levels = 4 dx = 45000, 15000, 3000, 1000 dy = 45000, 15000, 3000, 1000 grid_id = 1, 2, 3, 4 parent_id = 1, 1, 2, 3 i_parent_start = 1, 33, 63, 45 j_parent_start = 1, 23, 41, 33 parent_grid_ratio = 1, 3, 5, 3 parent_time_step_ratio = 1, 3, 5, 3 feedback = 0 smooth_option = 2 eta_levels = 1.0, 0.994, 0.982, 0.967, 0.949, 0.928, 0.906, 0.881, 0.855, 0.827, 0.798, 0.766, 0.734, 0.7, 0.665, 0.628, 0.59, 0.551, 0.511, 0.47, 0.427, 0.384, 0.339, 0.294, 0.247, 0.2, 0.151, 0.102, 0.051, 0.0 / &physics mp_physics = 7, 7, 7, 7 cu_physics = 1, 1, 1, 0 ra_lw_physics = 5, 5, 5, 5 ra_sw_physics = 5, 5, 5, 5 bl_pbl_physics = 1, 1, 1, 1 sf_sfclay_physics = 1, 1, 1, 1 sf_surface_physics = 2, 2, 2, 2 radt = 15, 15, 15, 15 bldt = 0, 0, 0, 0 cudt = 0, 0, 0, 0 icloud = 3 num_land_cat = 21 sf_urban_physics = 0, 0, 0, 0 cu_rad_feedback = .false., .false., .false., .false. grav_settling = 2, 2, 2, 2 gsfcgce_2ice = 0 gsfcgce_hail = 0 iz0tlnd = 1 kfeta_trigger = 2 num_soil_layers = 4 rdlai2d = .true. shcu_physics = 3, 3, 3, 3 sst_update = 1 swint_opt = 1 topo_wind = 1, 1, 1, 1 usemonalb = .true. ysu_topdown_pblmix = 1 / &fdda grid_fdda = 1, 1, 0, 0 gfdda_inname = 'wrffdda_d<domain>' gfdda_begin_h = 0, 0, 0, 0 gfdda_end_h = 96, 96, 96, 96 gfdda_interval_m = 360, 360, 360, 360 fgdt = 0, 0, 0, 0 if_no_pbl_nudging_uv = 0, 0, 0, 0 if_no_pbl_nudging_t = 1, 1, 1, 1 if_no_pbl_nudging_q = 1, 1, 1, 1 if_zfac_uv = 1, 1, 1, 1 k_zfac_uv = 7, 7, 7, 7 if_zfac_t = 1, 1, 1, 1 k_zfac_t = 7, 7, 7, 7 if_zfac_q = 1, 1, 1, 1 k_zfac_q = 7, 7, 7, 7 guv = 0.0003, 0.0003, 0.0003, 0.0003 gt = 0.0003, 0.0003, 0.0003, 0.0003 gq = 0.0003, 0.0003, 0.0003, 0.0003 if_ramping = 1 dtramp_min = 60 io_form_gfdda = 2 / &dynamics hybrid_opt = 2 w_damping = 0 diff_opt = 1, 1, 1, 1 km_opt = 4, 4, 4, 4 diff_6th_opt = 2, 2, 2, 2 diff_6th_factor = 0.12, 0.12, 0.12, 0.12 base_temp = 290.0 damp_opt = 3 zdamp = 5000.0, 5000.0, 5000.0, 5000.0 dampcoef = 0.2, 0.2, 0.2, 0.2 khdif = 0.0, 0.0, 0.0, 0.0 kvdif = 0.0, 0.0, 0.0, 0.0 non_hydrostatic = .true., .true., .true., .true. moist_adv_opt = 1, 1, 1, 1 scalar_adv_opt = 1, 1, 1, 1 gwd_opt = 1, 1, 1, 1 epssm = 0.2, 0.2, 0.2, 0.2 / &bdy_control spec_bdy_width = 5 spec_zone = 1 relax_zone = 4 specified = .true., .false., .false., .false. nested = .false., .true., .true., .true. / &grib2 / &namelist_quilt nio_tasks_per_group = 0 nio_groups = 1 / ``` ### Results Note: WRF was run with only 16 cores. Note: For Singularity, the command line option `--containall` needs to be specified or the WRF model will sometimes crash. The error messages look like: ```text malloc(): largebin double linked list corrupted (bk) Program received signal SIGABRT: Process abort signal. ``` ```text malloc(): unsorted double linked list corrupted Program received signal SIGABRT: Process abort signal. ``` WRF Run Type | WRF Wall Time (HH:MM:SS) -------------|------------------------- Bare Metal | 07:03:46 (**Baseline**) Docker | 07:31:16 (**+6.49%**) Singularity | 06:40:48 (**-5.42%**)