# Benchmark (2X 2080Ti; Leadtek WS800)
## Benchmark details
* Hardware
* **Leadtek WS800** (Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz; 64GB RAM)
* Gigabyte Geforce 2080Ti X2
* Software
* Official TensorFlow Docker image (```tensorflow/tensorflow:latest-gpu-py3-jupyter```), which includes:
* TensorFlow ```v1.14.0```
* CUDA ```v10.0.130```
* cuDNN ```v7.4.1.5-1+cuda10.0```
* Model
* ResNet152; batch size=56 per GPU.
* Task
* image classification
* Dataset
* ImageNet2012 (synthetic images)
## Benchmark: 2x 2080Ti
```bash
# the running command:
python3 tf_cnn_benchmarks.py --num_gpus=2 \
--batch_size=48 \
--model=resnet152 \
--nodistortions \
--all_reduce_spec=nccl
# Out-of-memory info appears if `batch_size=56`.
```
```bash
# benchmark output:
...
Initializing graph
W0621 05:38:04.082254 140088826672960 deprecation_wrapper.py:119] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2211: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
W0621 05:38:04.856455 140088826672960 deprecation.py:323] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2266: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-06-21 05:38:06.247899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:02:00.0
2019-06-21 05:38:06.249580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:03:00.0
2019-06-21 05:38:06.249622: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-21 05:38:06.249651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-21 05:38:06.249662: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-21 05:38:06.249673: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-21 05:38:06.249695: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-21 05:38:06.249721: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-21 05:38:06.249735: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-21 05:38:06.256085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2019-06-21 05:38:06.256211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 05:38:06.256223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2019-06-21 05:38:06.256232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N
2019-06-21 05:38:06.256240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N
2019-06-21 05:38:06.260160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2019-06-21 05:38:06.261258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10233 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)
2019-06-21 05:38:08.061918: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0621 05:38:10.374477 140088826672960 session_manager.py:500] Running local_init_op.
I0621 05:38:10.648975 140088826672960 session_manager.py:502] Done running local_init_op.
Running warm up
2019-06-21 05:38:19.633276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-21 05:38:20.332730: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-21 05:38:23.064493: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.064584: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.112912: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.112954: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.128051: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.128091: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.146236: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.146279: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.159671: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:38:23.159709: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Done warm up
Step Img/sec total_loss
1 images/sec: 212.7 +/- 0.0 (jitter = 0.0) 8.850
10 images/sec: 220.0 +/- 1.3 (jitter = 3.4) 8.880
20 images/sec: 220.3 +/- 0.7 (jitter = 1.7) 8.885
30 images/sec: 220.3 +/- 0.6 (jitter = 2.6) 8.724
40 images/sec: 220.3 +/- 0.5 (jitter = 2.8) 8.968
50 images/sec: 220.0 +/- 0.5 (jitter = 2.6) 8.608
60 images/sec: 219.7 +/- 0.4 (jitter = 2.2) 9.054
70 images/sec: 219.7 +/- 0.4 (jitter = 2.0) 8.669
80 images/sec: 219.5 +/- 0.3 (jitter = 2.1) 9.031
90 images/sec: 219.3 +/- 0.3 (jitter = 2.4) 9.006
100 images/sec: 219.1 +/- 0.3 (jitter = 2.6) 8.748
----------------------------------------------------------------
total images/sec: 219.01
----------------------------------------------------------------
```
## Benchmark: 1x 2080Ti
```bash
# the running command:
CUDA_VISIBLE_DEVICES=0 python3 tf_cnn_benchmarks.py --num_gpus=1 \
--batch_size=48 \
--model=resnet152 \
--nodistortions \
--all_reduce_spec=nccl
```
```
Initializing graph
W0621 05:40:20.000820 140495014573888 deprecation_wrapper.py:119] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2211: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
W0621 05:40:20.654015 140495014573888 deprecation.py:323] From /tf/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2266: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-06-21 05:40:21.869804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:02:00.0
2019-06-21 05:40:21.869853: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-21 05:40:21.869865: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-21 05:40:21.869885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-21 05:40:21.869896: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-21 05:40:21.869906: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-21 05:40:21.869917: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-21 05:40:21.869927: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-21 05:40:21.872965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-06-21 05:40:21.873005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 05:40:21.873013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-06-21 05:40:21.873032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-06-21 05:40:21.876564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2019-06-21 05:40:23.402122: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0621 05:40:24.162094 140495014573888 session_manager.py:500] Running local_init_op.
I0621 05:40:24.292371 140495014573888 session_manager.py:502] Done running local_init_op.
Running warm up
2019-06-21 05:40:28.721099: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-21 05:40:28.961160: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-21 05:40:30.724337: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.724383: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.42GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.773864: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.773889: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.31GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.788836: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.788893: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.807218: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.807245: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.46GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.820749: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-06-21 05:40:30.820775: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Done warm up
Step Img/sec total_loss
1 images/sec: 119.8 +/- 0.0 (jitter = 0.0) 8.790
10 images/sec: 120.1 +/- 0.1 (jitter = 0.1) 8.814
20 images/sec: 119.9 +/- 0.1 (jitter = 0.3) 8.780
30 images/sec: 119.7 +/- 0.1 (jitter = 0.4) 8.617
40 images/sec: 119.6 +/- 0.1 (jitter = 0.5) 9.012
50 images/sec: 119.4 +/- 0.1 (jitter = 0.6) 8.780
60 images/sec: 119.3 +/- 0.1 (jitter = 0.7) 9.066
70 images/sec: 119.1 +/- 0.1 (jitter = 0.8) 8.706
80 images/sec: 119.0 +/- 0.1 (jitter = 1.0) 9.015
90 images/sec: 118.8 +/- 0.1 (jitter = 0.9) 9.017
100 images/sec: 118.7 +/- 0.1 (jitter = 1.0) 8.880
----------------------------------------------------------------
total images/sec: 118.69
----------------------------------------------------------------
```
## Conclusion
```python
In [1]: 219.01 / 118.69
Out[1]: 1.845227062094532
```
i.e. we have achieved roughly **1.84x** speed-up for this image classification task.