# Bare Metal, Docker and Singularity Performance Tests
-----
## Test Environment
* Hardware
* CPU
* 2 × Intel Xeon Silver 4114
* Memory
* 4 × 32 GB SK HYNIX HMA84GR7CJR4N-XN
* Storage
* 2 × 480 GB SAMSUNG MZ7LH480HBHQ0D3 (RAID-1)
* 2 × 1 TB SEAGATE ST1000NX0443 (RAID-1)
* Network
* BROADCOM BCM57416 Ethernet
* NVIDIA Mellanox ConnectX-5
* Software
* CentOS Linux 8.4.2105
* Fully Up-to-date
* SELinux disabled
* Compiler Toolchain
* Package group "Development Tools" installed
* gcc 8.4.1 20200928 (Red Hat 8.4.1-1)
* g++ 8.4.1 20200928 (Red Hat 8.4.1-1)
* gfortran 8.4.1 20200928 (Red Hat 8.4.1-1)
* go 1.16.6
* NVIDIA MLNX OFED 5.4-1.0.3.0
* Open MPI 4.0.6
* UCX 1.11.0
* Docker 20.10.7
* Singularity 3.8.1
-----
## Test Item 1: High Performance Linpack (HPL)
* OpenBLAS 0.3.17
Build notes:
```bash
# Single-threaded (Choose this for MPI applications)
make TARGET=HASWELL DYNAMIC_ARCH=0 CC=gcc FC=gfortran USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0 NO_WARMUP=1 NO_AFFINITY=1 COMMON_OPT=-O3 PREFIX=/opt/openblas-0.3.17/gnu-8.4.0
# Multi-threaded
make TARGET=HASWELL DYNAMIC_ARCH=0 CC=gcc FC=gfortran USE_THREAD=1 USE_LOCKING=1 USE_OPENMP=0 NO_WARMUP=1 NO_AFFINITY=1 COMMON_OPT=-O3 PREFIX=/opt/openblas-0.3.17/gnu-8.4.0
```
* HPL 2.3
Build notes:
```diff
diff -Naurd setup/Make.Linux_ATHLON_FBLAS Make.Linux_Skylake_FBLAS
--- setup/Make.Linux_ATHLON_FBLAS 1970-01-01 13:00:00.000000000 +0800
+++ Make.Linux_Skylake_FBLAS 2021-08-11 16:54:41.157664767 +0800
@@ -61,13 +61,13 @@
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
-ARCH = Linux_ATHLON_FBLAS
+ARCH = Linux_Skylake_FBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
-TOPdir = $(HOME)/hpl
+TOPdir = /home/alice/hpl-src/hpl-2.3
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
@@ -81,9 +81,9 @@
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
-MPdir = /usr/local/mpi
+MPdir = /opt/openmpi-4.0.6/gnu-8.4.0
MPinc = -I$(MPdir)/include
-MPlib = $(MPdir)/lib/libmpich.a
+MPlib = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
@@ -92,9 +92,9 @@
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
-LAdir = $(HOME)/netlib/ARCHIVES/Linux_ATHLON
+LAdir = /opt/openblas-0.3.17/gnu-8.4.0
LAinc =
-LAlib = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a
+LAlib = $(LAdir)/lib/libopenblas.a
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
@@ -156,7 +156,7 @@
# *) call the BLAS Fortran 77 interface,
# *) not display detailed timing information.
#
-HPL_OPTS =
+HPL_OPTS = -DHPL_DETAILED_TIMING
#
# ----------------------------------------------------------------------
#
@@ -166,15 +166,15 @@
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
-CC = /usr/bin/gcc
+CC = mpicc
CCNOOPT = $(HPL_DEFS)
-CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall
+CCFLAGS = $(HPL_DEFS) -march=haswell -mtune=haswell -O3 -Wall -Wextra
#
-LINKER = /usr/bin/g77
+LINKER = mpicc
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
-ARFLAGS = r
+ARFLAGS = rcs
RANLIB = echo
#
# ----------------------------------------------------------------------
```
```bash
make arch=Linux_Skylake_FBLAS
```
For a dual-socket Intel Xeon Silver 4114 system, the theoretical peak double-precision floating-point performance is calculated to be **576.0** GFLOPS, as demonstrated below:
2 (Socket) × 10 (Core) × 1.8 (Frequency in GHz) × 1 (FMA Unit) × (2 × 8) (Double-precision FLOPS per Cycle) = **576.0** (GFLOPS)
References:
* [Intel Xeon Silver 4114 (Intel)](https://ark.intel.com/content/www/us/en/ark/products/123550/intel-xeon-silver-4114-processor-13-75m-cache-2-20-ghz.html)
* [Intel Xeon Silver 4114 (Wikichip)](https://en.wikichip.org/wiki/intel/xeon_silver/4114)
### AVX512-enabled Build of OpenBLAS and HPL
OpenBLAS and HPL were built to target and tune for `skylake-avx512`.
Note: HPL was run on bare metal with only 16 cores.
```text
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 117120
NB : 64
PMAP : Row-major process mapping
P : 4
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 117120 64 4 4 3587.95 2.9851e+02
HPL_pdgesv() start time Wed Aug 4 02:37:46 2021
HPL_pdgesv() end time Wed Aug 4 03:37:34 2021
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 4.54
+ Max aggregated wall time pfact . . : 1.58
+ Max aggregated wall time mxswp . . : 0.64
Max aggregated wall time update . . : 3581.40
+ Max aggregated wall time laswp . . : 81.90
Max aggregated wall time up tr sv . : 0.96
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.35396531e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
```
HPL Efficiency:
298.51 ÷ 460.8 × 100% = **64.78%**
### AVX2-enabled Build of OpenBLAS and HPL
OpenBLAS and HPL were built to target and tune for `haswell`.
Note: HPL was run on bare metal with only 16 cores.
```text
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 117120
NB : 64
PMAP : Row-major process mapping
P : 4
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 117120 64 4 4 2770.75 3.8656e+02
HPL_pdgesv() start time Wed Aug 11 17:27:50 2021
HPL_pdgesv() end time Wed Aug 11 18:14:01 2021
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 4.46
+ Max aggregated wall time pfact . . : 1.32
+ Max aggregated wall time mxswp . . : 0.43
Max aggregated wall time update . . : 2766.19
+ Max aggregated wall time laswp . . : 137.95
Max aggregated wall time up tr sv . : 0.83
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.42427217e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
```
HPL Efficiency:
386.56 ÷ 460.8 × 100% = **83.89%**
### Configuration
```text
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
file device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
117120 Ns
1 # of NBs
64 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
5 # of process grids (P x Q)
4 4 4 4 4 Ps
4 4 4 4 4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
```
### Results
Due to the above, OpenBLAS and HPL were built to target and tune for `haswell` instead of `skylake-avx512`.
Note: HPL was run with only 16 cores.
HPL Run Type | HPL Performance (GFLOPS)
-------------|-------------------------
Bare Metal | **386.37 (Baseline)** \*
\- | 388.20
\- | 386.00
\- | 386.02
\- | 385.85
\- | 385.78
Docker | **387.72 (+0.35%)** \*
\- | ~~339.40~~ †
\- | 388.55
\- | 388.58
\- | 388.02
\- | 386.63
\- | 386.82
Singularity | **387.36 (+0.26%)** \*
\- | ~~338.80~~ †
\- | 387.54
\- | 387.27
\- | 387.29
\- | 387.38
\- | 387.34
\* Average of 5 consecutive runs below.
† After a bare metal host reboot, a value in this range can no longer be reproduced and is therefore discarded as an outlier.
-----
## Test Item 2: Weather Research and Forecasting (WRF) Model
* WRF 4.2.2
* WPS 4.2
The "NCEP GDAS/FNL 0.25-Degree Global Tropospheric Analyses" dataset was used as the global model data for this simulation.
### Configuration
The following namelist file was used:
```text
&share
wrf_core = 'ARW'
max_dom = 3
start_date = '2021-07-21_00:00:00', '2021-07-21_00:00:00', '2021-07-21_00:00:00',
'2021-07-21_00:00:00'
end_date = '2021-07-25_00:00:00', '2021-07-25_00:00:00', '2021-07-25_00:00:00',
'2021-07-25_00:00:00'
interval_seconds = 21600
io_form_geogrid = 2
opt_output_from_geogrid_path = './'
/
&geogrid
parent_id = 1, 1, 2, 3
parent_grid_ratio = 1, 3, 5, 3
i_parent_start = 1, 33, 63, 45
j_parent_start = 1, 23, 41, 33
e_we = 121, 172, 216, 331
e_sn = 105, 130, 226, 442
geog_data_res = 'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s',
'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s',
'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s',
'gmted2010_30s+modis_15s_lake+modis_fpar+modis_lai+30s'
dx = 45000
dy = 45000
map_proj = 'lambert'
ref_lat = 27.5
ref_lon = 121.0
truelat1 = 30.0
truelat2 = 25.0
stand_lon = 121.0
geog_data_path = '/mnt/geographic-data'
opt_geogrid_tbl_path = './geogrid/'
/
&ungrib
out_format = 'WPS'
prefix = 'FILE'
/
&metgrid
fg_name = 'FILE'
io_form_metgrid = 2
opt_output_from_metgrid_path = './'
opt_metgrid_tbl_path = './metgrid/'
/
&time_control
run_days = 0
run_hours = 96
run_minutes = 0
run_seconds = 0
start_year = 2021, 2021, 2021, 2021
start_month = 7, 7, 7, 7
start_day = 21, 21, 21, 21
start_hour = 0, 0, 0, 0
start_minute = 0, 0, 0, 0
start_second = 0, 0, 0, 0
end_year = 2021, 2021, 2021, 2021
end_month = 7, 7, 7, 7
end_day = 25, 25, 25, 25
end_hour = 0, 0, 0, 0
end_minute = 0, 0, 0, 0
end_second = 0, 0, 0, 0
interval_seconds = 21600
input_from_file = .true., .true., .true., .true.
history_interval = 60, 60, 60, 60
frames_per_outfile = 24, 24, 24, 24
restart = .false.
restart_interval = 10080
io_form_history = 2
io_form_restart = 2
io_form_input = 2
io_form_boundary = 2
auxinput4_inname = 'wrflowinp_d<domain>'
auxinput4_interval = 360, 360, 360, 360
io_form_auxinput4 = 2
/
&domains
time_step = 180
time_step_fract_num = 0
time_step_fract_den = 1
max_dom = 3
e_we = 121, 172, 216, 331
e_sn = 105, 130, 226, 442
e_vert = 30, 30, 30, 30
p_top_requested = 5000
num_metgrid_levels = 34
num_metgrid_soil_levels = 4
dx = 45000, 15000, 3000, 1000
dy = 45000, 15000, 3000, 1000
grid_id = 1, 2, 3, 4
parent_id = 1, 1, 2, 3
i_parent_start = 1, 33, 63, 45
j_parent_start = 1, 23, 41, 33
parent_grid_ratio = 1, 3, 5, 3
parent_time_step_ratio = 1, 3, 5, 3
feedback = 0
smooth_option = 2
eta_levels = 1.0, 0.994, 0.982, 0.967, 0.949, 0.928, 0.906, 0.881, 0.855,
0.827, 0.798, 0.766, 0.734, 0.7, 0.665, 0.628, 0.59, 0.551,
0.511, 0.47, 0.427, 0.384, 0.339, 0.294, 0.247, 0.2, 0.151,
0.102, 0.051, 0.0
/
&physics
mp_physics = 7, 7, 7, 7
cu_physics = 1, 1, 1, 0
ra_lw_physics = 5, 5, 5, 5
ra_sw_physics = 5, 5, 5, 5
bl_pbl_physics = 1, 1, 1, 1
sf_sfclay_physics = 1, 1, 1, 1
sf_surface_physics = 2, 2, 2, 2
radt = 15, 15, 15, 15
bldt = 0, 0, 0, 0
cudt = 0, 0, 0, 0
icloud = 3
num_land_cat = 21
sf_urban_physics = 0, 0, 0, 0
cu_rad_feedback = .false., .false., .false., .false.
grav_settling = 2, 2, 2, 2
gsfcgce_2ice = 0
gsfcgce_hail = 0
iz0tlnd = 1
kfeta_trigger = 2
num_soil_layers = 4
rdlai2d = .true.
shcu_physics = 3, 3, 3, 3
sst_update = 1
swint_opt = 1
topo_wind = 1, 1, 1, 1
usemonalb = .true.
ysu_topdown_pblmix = 1
/
&fdda
grid_fdda = 1, 1, 0, 0
gfdda_inname = 'wrffdda_d<domain>'
gfdda_begin_h = 0, 0, 0, 0
gfdda_end_h = 96, 96, 96, 96
gfdda_interval_m = 360, 360, 360, 360
fgdt = 0, 0, 0, 0
if_no_pbl_nudging_uv = 0, 0, 0, 0
if_no_pbl_nudging_t = 1, 1, 1, 1
if_no_pbl_nudging_q = 1, 1, 1, 1
if_zfac_uv = 1, 1, 1, 1
k_zfac_uv = 7, 7, 7, 7
if_zfac_t = 1, 1, 1, 1
k_zfac_t = 7, 7, 7, 7
if_zfac_q = 1, 1, 1, 1
k_zfac_q = 7, 7, 7, 7
guv = 0.0003, 0.0003, 0.0003, 0.0003
gt = 0.0003, 0.0003, 0.0003, 0.0003
gq = 0.0003, 0.0003, 0.0003, 0.0003
if_ramping = 1
dtramp_min = 60
io_form_gfdda = 2
/
&dynamics
hybrid_opt = 2
w_damping = 0
diff_opt = 1, 1, 1, 1
km_opt = 4, 4, 4, 4
diff_6th_opt = 2, 2, 2, 2
diff_6th_factor = 0.12, 0.12, 0.12, 0.12
base_temp = 290.0
damp_opt = 3
zdamp = 5000.0, 5000.0, 5000.0, 5000.0
dampcoef = 0.2, 0.2, 0.2, 0.2
khdif = 0.0, 0.0, 0.0, 0.0
kvdif = 0.0, 0.0, 0.0, 0.0
non_hydrostatic = .true., .true., .true., .true.
moist_adv_opt = 1, 1, 1, 1
scalar_adv_opt = 1, 1, 1, 1
gwd_opt = 1, 1, 1, 1
epssm = 0.2, 0.2, 0.2, 0.2
/
&bdy_control
spec_bdy_width = 5
spec_zone = 1
relax_zone = 4
specified = .true., .false., .false., .false.
nested = .false., .true., .true., .true.
/
&grib2
/
&namelist_quilt
nio_tasks_per_group = 0
nio_groups = 1
/
```
### Results
Note: WRF was run with only 16 cores.
Note: For Singularity, the command line option `--containall` needs to be specified or the WRF model will sometimes crash. The error messages look like:
```text
malloc(): largebin double linked list corrupted (bk)
Program received signal SIGABRT: Process abort signal.
```
```text
malloc(): unsorted double linked list corrupted
Program received signal SIGABRT: Process abort signal.
```
WRF Run Type | WRF Wall Time (HH:MM:SS)
-------------|-------------------------
Bare Metal | 07:03:46 (**Baseline**)
Docker | 07:31:16 (**+6.49%**)
Singularity | 06:40:48 (**-5.42%**)