Benchmark (2X 2080Ti; Leadtek WS800)

# Benchmark (2X 2080Ti; Leadtek WS800) ## Benchmark details * Hardware * **Leadtek WS800** (Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz; 64GB RAM) * Gigabyte Geforce 2080Ti X2 * Software * Official TensorFlow Docker image (```tensorflow/tensorflow:latest-gpu-py3-jupyter```), which includes: * TensorFlow ```v1.14.0``` * CUDA ```v10.0.130``` * cuDNN ```v7.4.1.5-1+cuda10.0``` * Model * ResNet152; batch size=56 per GPU. * Task * image classification * Dataset * ImageNet2012 (synthetic images) ## Benchmark: 2x 2080Ti ```bash # the running command: python3 tf_cnn_benchmarks.py --num_gpus=2 \ --batch_size=48 \ --model=resnet152 \ --nodistortions \ --all_reduce_spec=nccl # Out-of-memory info appears if `batch_size=56`. ``` ```bash # benchmark output: ... Initializing graph W0621 05:38:04.082254 140088826672960 deprecation_wrapper.py:119] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2211: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. W0621 05:38:04.856455 140088826672960 deprecation.py:323] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2266: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-06-21 05:38:06.247899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:02:00.0 2019-06-21 05:38:06.249580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:03:00.0 2019-06-21 05:38:06.249622: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-06-21 05:38:06.249651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-06-21 05:38:06.249662: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-06-21 05:38:06.249673: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-06-21 05:38:06.249695: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-06-21 05:38:06.249721: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-06-21 05:38:06.249735: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-06-21 05:38:06.256085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2019-06-21 05:38:06.256211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-21 05:38:06.256223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2019-06-21 05:38:06.256232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N 2019-06-21 05:38:06.256240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N 2019-06-21 05:38:06.260160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5) 2019-06-21 05:38:06.261258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5) 2019-06-21 05:38:08.061918: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. I0621 05:38:10.374477 140088826672960 session_manager.py:500] Running local_init_op. I0621 05:38:10.648975 140088826672960 session_manager.py:502] Done running local_init_op. Running warm up 2019-06-21 05:38:19.633276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-06-21 05:38:20.332730: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-06-21 05:38:23.064493: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.064584: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.112912: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.112954: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.128051: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.128091: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.146236: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.146279: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.159671: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:38:23.159709: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Done warm up Step Img/sec total_loss 1 images/sec: 212.7 +/- 0.0 (jitter = 0.0) 8.850 10 images/sec: 220.0 +/- 1.3 (jitter = 3.4) 8.880 20 images/sec: 220.3 +/- 0.7 (jitter = 1.7) 8.885 30 images/sec: 220.3 +/- 0.6 (jitter = 2.6) 8.724 40 images/sec: 220.3 +/- 0.5 (jitter = 2.8) 8.968 50 images/sec: 220.0 +/- 0.5 (jitter = 2.6) 8.608 60 images/sec: 219.7 +/- 0.4 (jitter = 2.2) 9.054 70 images/sec: 219.7 +/- 0.4 (jitter = 2.0) 8.669 80 images/sec: 219.5 +/- 0.3 (jitter = 2.1) 9.031 90 images/sec: 219.3 +/- 0.3 (jitter = 2.4) 9.006 100 images/sec: 219.1 +/- 0.3 (jitter = 2.6) 8.748 ---------------------------------------------------------------- total images/sec: 219.01 ---------------------------------------------------------------- ``` ## Benchmark: 1x 2080Ti ```bash # the running command: CUDA_VISIBLE_DEVICES=0 python3 tf_cnn_benchmarks.py --num_gpus=1 \ --batch_size=48 \ --model=resnet152 \ --nodistortions \ --all_reduce_spec=nccl ``` ``` Initializing graph W0621 05:40:20.000820 140495014573888 deprecation_wrapper.py:119] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2211: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead. W0621 05:40:20.654015 140495014573888 deprecation.py:323] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2266: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-06-21 05:40:21.869804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:02:00.0 2019-06-21 05:40:21.869853: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-06-21 05:40:21.869865: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-06-21 05:40:21.869885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-06-21 05:40:21.869896: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-06-21 05:40:21.869906: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-06-21 05:40:21.869917: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-06-21 05:40:21.869927: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-06-21 05:40:21.872965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2019-06-21 05:40:21.873005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-21 05:40:21.873013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-06-21 05:40:21.873032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2019-06-21 05:40:21.876564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5) 2019-06-21 05:40:23.402122: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. I0621 05:40:24.162094 140495014573888 session_manager.py:500] Running local_init_op. I0621 05:40:24.292371 140495014573888 session_manager.py:502] Done running local_init_op. Running warm up 2019-06-21 05:40:28.721099: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-06-21 05:40:28.961160: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-06-21 05:40:30.724337: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.724383: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.773864: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.773889: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.788836: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.788893: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.807218: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.807245: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.820749: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-06-21 05:40:30.820775: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Done warm up Step Img/sec total_loss 1 images/sec: 119.8 +/- 0.0 (jitter = 0.0) 8.790 10 images/sec: 120.1 +/- 0.1 (jitter = 0.1) 8.814 20 images/sec: 119.9 +/- 0.1 (jitter = 0.3) 8.780 30 images/sec: 119.7 +/- 0.1 (jitter = 0.4) 8.617 40 images/sec: 119.6 +/- 0.1 (jitter = 0.5) 9.012 50 images/sec: 119.4 +/- 0.1 (jitter = 0.6) 8.780 60 images/sec: 119.3 +/- 0.1 (jitter = 0.7) 9.066 70 images/sec: 119.1 +/- 0.1 (jitter = 0.8) 8.706 80 images/sec: 119.0 +/- 0.1 (jitter = 1.0) 9.015 90 images/sec: 118.8 +/- 0.1 (jitter = 0.9) 9.017 100 images/sec: 118.7 +/- 0.1 (jitter = 1.0) 8.880 ---------------------------------------------------------------- total images/sec: 118.69 ---------------------------------------------------------------- ``` ## Conclusion ```python In [1]: 219.01 / 118.69 Out[1]: 1.845227062094532 ``` i.e. we have achieved roughly **1.84x** speed-up for this image classification task.