2080Ti and V100 Benchmarks

# 2080Ti and V100 Benchmarks Experiments are performed using synthesis data (ImageNet). * Hardware * LeadTek GS4820A * 4X GeForce RTX 2080 Ti + 4x Tesla V100 * Software The environment is packed as a docker image: ```honghu/keras:tf-cu10.0-dnn7.4-py3-avx2-19.01``` and is [available on DockerHub](https://hub.docker.com/r/honghu/keras/). We run the benchmarks using the official TensorFlow repo: ```bash git clone --branch cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git cd benchmarks/scripts/tf_cnn_benchmarks ``` ### 4X GeForce RTX 2080 Ti (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:15:47.726546 140230479963968 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:15:53.902021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-01-14 10:15:53.916469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:15:53.916481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-01-14 10:15:53.916488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N N 2019-01-14 10:15:53.916493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N N N 2019-01-14 10:15:53.916498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N N 2019-01-14 10:15:53.916502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N N N 2019-01-14 10:15:53.918515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-01-14 10:15:53.923486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-01-14 10:15:53.930171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10168 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-01-14 10:15:53.937095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10168 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0114 10:16:04.913285 140230479963968 tf_logging.py:115] Running local_init_op. I0114 10:16:08.349156 140230479963968 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 432.3 +/- 0.0 (jitter = 0.0) 10.080 10 images/sec: 431.3 +/- 0.8 (jitter = 1.8) 9.924 20 images/sec: 431.6 +/- 0.6 (jitter = 2.6) 9.975 30 images/sec: 431.0 +/- 0.5 (jitter = 1.8) 9.900 40 images/sec: 431.5 +/- 0.5 (jitter = 2.1) 9.846 50 images/sec: 431.4 +/- 0.5 (jitter = 2.2) 9.870 60 images/sec: 431.0 +/- 0.4 (jitter = 2.5) 9.968 70 images/sec: 431.0 +/- 0.5 (jitter = 3.2) 9.862 80 images/sec: 431.3 +/- 0.5 (jitter = 3.6) 9.932 90 images/sec: 431.2 +/- 0.4 (jitter = 3.3) 9.967 100 images/sec: 431.0 +/- 0.4 (jitter = 3.0) 9.935 ---------------------------------------------------------------- total images/sec: 430.91 ---------------------------------------------------------------- ``` ### 4X GeForce Quadro RTX 8000 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: replicated AllReduce: None ========== Generating model W0306 08:43:44.548509 139684100622144 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-06 08:43:51.797984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0c:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:43:52.348729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0d:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:43:52.925093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0e:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:43:53.639499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0f:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:43:53.718676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-06 08:43:56.796882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-06 08:43:56.796930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-06 08:43:56.796938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-03-06 08:43:56.796943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-03-06 08:43:56.796948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-03-06 08:43:56.796953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-03-06 08:43:56.818119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-03-06 08:43:56.820568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-03-06 08:43:56.821078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-03-06 08:43:56.821575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0306 08:44:08.992272 139684100622144 tf_logging.py:115] Running local_init_op. I0306 08:44:13.005366 139684100622144 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 779.5 +/- 0.0 (jitter = 0.0) 9.891 10 images/sec: 746.1 +/- 8.7 (jitter = 22.5) 9.761 20 images/sec: 744.8 +/- 4.8 (jitter = 22.0) 9.710 30 images/sec: 748.3 +/- 4.7 (jitter = 20.6) 9.929 40 images/sec: 750.6 +/- 3.7 (jitter = 18.2) 9.950 50 images/sec: 751.1 +/- 3.2 (jitter = 17.5) 9.790 60 images/sec: 751.3 +/- 2.7 (jitter = 14.7) 9.806 70 images/sec: 751.1 +/- 2.4 (jitter = 15.8) 9.851 80 images/sec: 750.6 +/- 2.4 (jitter = 15.6) 9.820 90 images/sec: 750.2 +/- 2.2 (jitter = 16.7) 9.871 100 images/sec: 750.5 +/- 2.0 (jitter = 17.0) 9.786 ---------------------------------------------------------------- total images/sec: 750.02 ---------------------------------------------------------------- ``` ### 4X V100 PCIE 32GB (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: replicated AllReduce: None ========== Generating model W0306 08:50:32.981509 140223759439680 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-06 08:50:41.077971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:04:00.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-03-06 08:50:41.921124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:06:00.0 totalMemory: 15.75GiB freeMemory: 15.34GiB 2019-03-06 08:50:42.780290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:07:00.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-03-06 08:50:43.709868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:08:00.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-03-06 08:50:43.782253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-06 08:50:45.838085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-06 08:50:45.838139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-06 08:50:45.838146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-03-06 08:50:45.838150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-03-06 08:50:45.838155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-03-06 08:50:45.838160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-03-06 08:50:45.846207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 2019-03-06 08:50:45.848779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14846 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0) 2019-03-06 08:50:45.849314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30378 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0) 2019-03-06 08:50:45.849784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0) I0306 08:50:56.379636 140223759439680 tf_logging.py:115] Running local_init_op. I0306 08:51:00.715517 140223759439680 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 736.6 +/- 0.0 (jitter = 0.0) 9.910 10 images/sec: 807.3 +/- 10.0 (jitter = 24.3) 9.756 20 images/sec: 810.1 +/- 8.2 (jitter = 28.5) 9.725 30 images/sec: 820.2 +/- 6.6 (jitter = 19.1) 9.924 40 images/sec: 814.2 +/- 6.2 (jitter = 33.7) 9.938 50 images/sec: 808.9 +/- 5.7 (jitter = 33.0) 9.798 60 images/sec: 811.7 +/- 5.1 (jitter = 36.4) 9.818 70 images/sec: 808.6 +/- 4.8 (jitter = 35.1) 9.846 80 images/sec: 808.5 +/- 4.4 (jitter = 33.0) 9.831 90 images/sec: 806.9 +/- 4.2 (jitter = 33.4) 9.909 100 images/sec: 805.3 +/- 4.0 (jitter = 33.5) 9.778 ---------------------------------------------------------------- total images/sec: 804.73 ---------------------------------------------------------------- ``` ### 4X Quadro RTX 8000 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=256 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 1024 global 256.0 per device Num batches: 100 Num epochs: 0.08 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: replicated AllReduce: None ========== Generating model W0306 08:55:52.772576 139757467600704 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-06 08:56:01.631109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0c:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:56:02.232898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0d:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:56:03.182707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0e:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:56:04.382598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0f:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 08:56:04.436815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-06 08:56:06.766942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-06 08:56:06.766985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-06 08:56:06.766992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-03-06 08:56:06.766997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-03-06 08:56:06.767001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-03-06 08:56:06.767005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-03-06 08:56:06.774166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-03-06 08:56:06.775358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-03-06 08:56:06.775978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-03-06 08:56:06.776465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0306 08:56:19.953084 139757467600704 tf_logging.py:115] Running local_init_op. I0306 08:56:28.303954 139757467600704 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 1013.0 +/- 0.0 (jitter = 0.0) 9.850 10 images/sec: 1007.5 +/- 1.2 (jitter = 3.1) 9.880 20 images/sec: 1005.7 +/- 0.8 (jitter = 4.1) 9.823 30 images/sec: 1003.9 +/- 0.8 (jitter = 4.4) 9.780 40 images/sec: 1002.4 +/- 0.7 (jitter = 4.9) 9.705 50 images/sec: 1000.6 +/- 0.8 (jitter = 5.1) 9.689 60 images/sec: 999.8 +/- 0.7 (jitter = 5.0) 9.654 70 images/sec: 999.0 +/- 0.7 (jitter = 5.6) 9.625 80 images/sec: 998.0 +/- 0.7 (jitter = 5.5) 9.599 90 images/sec: 996.9 +/- 0.7 (jitter = 6.6) 9.580 100 images/sec: 996.3 +/- 0.7 (jitter = 6.3) 9.559 ---------------------------------------------------------------- total images/sec: 996.13 ---------------------------------------------------------------- ``` ### 4X Quadro RTX 8000 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=96 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 384 global 96.0 per device Num batches: 100 Num epochs: 0.03 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: replicated AllReduce: None ========== Generating model W0306 09:02:11.327498 140704704161600 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-06 09:02:18.744726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0c:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:02:19.604581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0d:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:02:20.804147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0e:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:02:21.381372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0f:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:02:21.416588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-06 09:02:23.722670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-06 09:02:23.722729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-06 09:02:23.722750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-03-06 09:02:23.722766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-03-06 09:02:23.722780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-03-06 09:02:23.722793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-03-06 09:02:23.748770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-03-06 09:02:23.750335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-03-06 09:02:23.751031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-03-06 09:02:23.751535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0306 09:02:34.783475 140704704161600 tf_logging.py:115] Running local_init_op. I0306 09:02:39.586142 140704704161600 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 877.0 +/- 0.0 (jitter = 0.0) 9.871 10 images/sec: 876.2 +/- 1.8 (jitter = 1.9) 9.873 20 images/sec: 878.4 +/- 1.3 (jitter = 4.2) 9.847 30 images/sec: 876.8 +/- 1.1 (jitter = 4.9) 9.839 40 images/sec: 874.9 +/- 1.0 (jitter = 5.8) 9.806 50 images/sec: 874.3 +/- 0.9 (jitter = 6.1) 9.893 60 images/sec: 873.2 +/- 0.9 (jitter = 6.6) 9.823 70 images/sec: 872.3 +/- 0.8 (jitter = 7.2) 9.892 80 images/sec: 871.4 +/- 0.8 (jitter = 6.6) 9.790 90 images/sec: 870.2 +/- 0.8 (jitter = 7.7) 9.795 100 images/sec: 869.6 +/- 0.8 (jitter = 7.0) 9.746 ---------------------------------------------------------------- total images/sec: 869.30 ---------------------------------------------------------------- ``` ### 4X Quadro RTX 8000 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=480 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 1920 global 480.0 per device Num batches: 100 Num epochs: 0.15 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: replicated AllReduce: None ========== Generating model W0306 09:09:39.098187 140650919057216 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-03-06 09:09:50.061734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0c:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:09:50.927108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0d:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:09:52.016024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0e:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:09:52.936119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:0f:00.0 totalMemory: 47.43GiB freeMemory: 47.23GiB 2019-03-06 09:09:53.042688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-06 09:09:55.918460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-06 09:09:55.918506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-06 09:09:55.918514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-03-06 09:09:55.918519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-03-06 09:09:55.918524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-03-06 09:09:55.918528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-03-06 09:09:55.926480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-03-06 09:09:55.927890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-03-06 09:09:55.928336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-03-06 09:09:55.928650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0306 09:10:10.047497 140650919057216 tf_logging.py:115] Running local_init_op. I0306 09:10:24.649841 140650919057216 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 1070.7 +/- 0.0 (jitter = 0.0) 9.861 10 images/sec: 1074.4 +/- 1.2 (jitter = 2.9) 9.772 20 images/sec: 1073.4 +/- 0.9 (jitter = 4.8) 9.730 30 images/sec: 1071.3 +/- 0.9 (jitter = 5.6) 9.652 40 images/sec: 1069.7 +/- 0.8 (jitter = 6.0) 9.612 50 images/sec: 1067.9 +/- 0.8 (jitter = 6.4) 9.572 60 images/sec: 1066.7 +/- 0.8 (jitter = 6.3) 9.546 70 images/sec: 1065.6 +/- 0.8 (jitter = 6.3) 9.524 80 images/sec: 1064.6 +/- 0.7 (jitter = 5.6) 9.526 90 images/sec: 1063.8 +/- 0.7 (jitter = 5.6) 9.521 100 images/sec: 1063.1 +/- 0.7 (jitter = 5.4) 9.516 ---------------------------------------------------------------- total images/sec: 1062.99 ---------------------------------------------------------------- ``` ### 2X GeForce RTX 2080 Ti (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5 python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 80 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:24:47.426653 140179035752256 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:24:50.765500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-01-14 10:24:50.767682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:24:50.767695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-01-14 10:24:50.767701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N 2019-01-14 10:24:50.767706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N 2019-01-14 10:24:50.768625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-01-14 10:24:50.773317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5) I0114 10:24:56.086123 140179035752256 tf_logging.py:115] Running local_init_op. I0114 10:24:57.728647 140179035752256 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 336.0 +/- 0.0 (jitter = 0.0) 9.968 10 images/sec: 333.8 +/- 2.0 (jitter = 3.9) 9.874 20 images/sec: 331.7 +/- 1.4 (jitter = 5.7) 9.975 30 images/sec: 331.1 +/- 1.0 (jitter = 4.7) 9.926 40 images/sec: 332.1 +/- 0.9 (jitter = 5.2) 9.879 50 images/sec: 331.2 +/- 0.8 (jitter = 5.3) 9.827 60 images/sec: 331.1 +/- 0.7 (jitter = 4.8) 10.041 70 images/sec: 330.6 +/- 0.6 (jitter = 5.3) 9.767 80 images/sec: 330.7 +/- 0.6 (jitter = 5.6) 9.959 90 images/sec: 330.7 +/- 0.6 (jitter = 5.3) 10.099 100 images/sec: 330.5 +/- 0.5 (jitter = 5.6) 9.871 ---------------------------------------------------------------- total images/sec: 330.38 ---------------------------------------------------------------- ``` ### 1X GeForce RTX 2080 Ti (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 40 global 40.0 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:27:58.245692 140021949130560 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:27:59.638065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-14 10:27:59.638142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:27:59.638150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-14 10:27:59.638156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-14 10:27:59.638413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) I0114 10:28:01.636178 140021949130560 tf_logging.py:115] Running local_init_op. I0114 10:28:01.808087 140021949130560 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 183.9 +/- 0.0 (jitter = 0.0) 9.857 10 images/sec: 187.8 +/- 0.5 (jitter = 1.2) 10.075 20 images/sec: 189.2 +/- 0.8 (jitter = 2.6) 10.146 30 images/sec: 188.8 +/- 0.5 (jitter = 2.0) 9.800 40 images/sec: 189.2 +/- 0.5 (jitter = 2.6) 10.111 50 images/sec: 189.4 +/- 0.4 (jitter = 2.5) 9.862 60 images/sec: 189.2 +/- 0.3 (jitter = 2.2) 10.208 70 images/sec: 189.0 +/- 0.3 (jitter = 2.1) 9.681 80 images/sec: 189.0 +/- 0.3 (jitter = 2.1) 10.123 90 images/sec: 188.9 +/- 0.3 (jitter = 2.2) 10.175 100 images/sec: 188.8 +/- 0.3 (jitter = 2.1) 9.838 ---------------------------------------------------------------- total images/sec: 188.72 ---------------------------------------------------------------- ``` ### 4X GeForce RTX 2080 Ti (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:31:28.515695 140609529079616 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:31:33.855940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-01-14 10:31:33.873226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:31:33.873242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-01-14 10:31:33.873249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N N 2019-01-14 10:31:33.873253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N N N 2019-01-14 10:31:33.873257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N N 2019-01-14 10:31:33.873261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N N N 2019-01-14 10:31:33.877902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-01-14 10:31:33.883196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5) 2019-01-14 10:31:33.888513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10168 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:0e:00.0, compute capability: 7.5) 2019-01-14 10:31:33.894109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10168 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:0f:00.0, compute capability: 7.5) I0114 10:31:43.655935 140609529079616 tf_logging.py:115] Running local_init_op. I0114 10:31:46.790667 140609529079616 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 282.5 +/- 0.0 (jitter = 0.0) 9.868 10 images/sec: 283.4 +/- 0.4 (jitter = 1.5) 9.824 20 images/sec: 283.7 +/- 0.3 (jitter = 1.7) 9.978 30 images/sec: 283.5 +/- 0.3 (jitter = 2.0) 9.960 40 images/sec: 283.4 +/- 0.2 (jitter = 2.0) 9.980 50 images/sec: 283.1 +/- 0.2 (jitter = 1.8) 9.801 60 images/sec: 282.9 +/- 0.2 (jitter = 1.6) 9.911 70 images/sec: 282.7 +/- 0.2 (jitter = 1.7) 9.863 80 images/sec: 282.5 +/- 0.2 (jitter = 1.7) 9.933 90 images/sec: 282.4 +/- 0.2 (jitter = 1.7) 9.830 100 images/sec: 282.3 +/- 0.2 (jitter = 1.6) 9.798 ---------------------------------------------------------------- total images/sec: 282.24 ---------------------------------------------------------------- ``` ### 2X GeForce RTX 2080 Ti (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4,5 python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 80 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:34:25.444249 140291245438784 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:34:28.001151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-01-14 10:34:28.003420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:34:28.003436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-01-14 10:34:28.003443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N 2019-01-14 10:34:28.003448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N 2019-01-14 10:34:28.004899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) 2019-01-14 10:34:28.011071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5) I0114 10:34:33.320369 140291245438784 tf_logging.py:115] Running local_init_op. I0114 10:34:34.740233 140291245438784 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 205.7 +/- 0.0 (jitter = 0.0) 9.789 10 images/sec: 206.1 +/- 0.3 (jitter = 1.3) 9.812 20 images/sec: 205.9 +/- 0.4 (jitter = 1.5) 9.996 30 images/sec: 205.9 +/- 0.3 (jitter = 1.5) 9.851 40 images/sec: 205.6 +/- 0.3 (jitter = 1.3) 10.102 50 images/sec: 205.5 +/- 0.2 (jitter = 1.3) 9.877 60 images/sec: 205.3 +/- 0.2 (jitter = 1.5) 9.866 70 images/sec: 205.2 +/- 0.2 (jitter = 1.4) 9.916 80 images/sec: 205.1 +/- 0.2 (jitter = 1.5) 9.897 90 images/sec: 205.1 +/- 0.2 (jitter = 1.5) 9.799 100 images/sec: 205.0 +/- 0.2 (jitter = 1.5) 9.787 ---------------------------------------------------------------- total images/sec: 204.94 ---------------------------------------------------------------- ``` ### 1X GeForce RTX 2080 Ti (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=4 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 40 global 40.0 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:38:45.863283 140694985090880 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:38:47.438329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-14 10:38:47.438394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:38:47.438402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-14 10:38:47.438408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-14 10:38:47.439365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5) I0114 10:38:49.174678 140694985090880 tf_logging.py:115] Running local_init_op. I0114 10:38:49.365254 140694985090880 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 115.1 +/- 0.0 (jitter = 0.0) 9.865 10 images/sec: 113.0 +/- 0.6 (jitter = 1.2) 9.741 20 images/sec: 112.8 +/- 0.4 (jitter = 1.5) 10.067 30 images/sec: 112.9 +/- 0.3 (jitter = 1.2) 9.834 40 images/sec: 112.9 +/- 0.2 (jitter = 1.1) 10.052 50 images/sec: 113.0 +/- 0.2 (jitter = 0.9) 9.889 60 images/sec: 113.0 +/- 0.2 (jitter = 1.0) 9.771 70 images/sec: 112.8 +/- 0.2 (jitter = 1.2) 9.697 80 images/sec: 112.6 +/- 0.2 (jitter = 1.3) 9.946 90 images/sec: 112.5 +/- 0.1 (jitter = 1.3) 9.611 100 images/sec: 112.3 +/- 0.1 (jitter = 1.6) 9.870 ---------------------------------------------------------------- total images/sec: 112.24 ---------------------------------------------------------------- ``` ### 4X V100 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:47:45.852792 140483332290368 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:47:52.202071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-01-14 10:47:52.204629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:47:52.204641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-01-14 10:47:52.204647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-01-14 10:47:52.204669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-01-14 10:47:52.204673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-01-14 10:47:52.204678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-01-14 10:47:52.207206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 2019-01-14 10:47:52.212262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0) 2019-01-14 10:47:52.217073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0) 2019-01-14 10:47:52.220718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0) I0114 10:48:03.024006 140483332290368 tf_logging.py:115] Running local_init_op. I0114 10:48:06.137128 140483332290368 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 698.3 +/- 0.0 (jitter = 0.0) 10.070 10 images/sec: 719.6 +/- 4.9 (jitter = 13.5) 9.957 20 images/sec: 720.6 +/- 3.1 (jitter = 13.6) 9.992 30 images/sec: 711.8 +/- 3.6 (jitter = 24.0) 9.895 40 images/sec: 711.0 +/- 3.5 (jitter = 24.3) 9.861 50 images/sec: 710.3 +/- 3.0 (jitter = 23.3) 9.889 60 images/sec: 710.1 +/- 2.7 (jitter = 22.5) 9.974 70 images/sec: 711.8 +/- 2.5 (jitter = 21.8) 9.875 80 images/sec: 710.3 +/- 2.5 (jitter = 21.8) 9.925 90 images/sec: 710.7 +/- 2.3 (jitter = 21.3) 9.980 100 images/sec: 710.8 +/- 2.2 (jitter = 20.1) 9.956 ---------------------------------------------------------------- total images/sec: 710.49 ---------------------------------------------------------------- ``` ### 2X V100 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0,1 python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 80 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:50:57.990253 140520091227968 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:51:01.123257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-01-14 10:51:01.124415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:51:01.124431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-01-14 10:51:01.124438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y 2019-01-14 10:51:01.124443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N 2019-01-14 10:51:01.126344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 2019-01-14 10:51:01.133181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0) I0114 10:51:07.232126 140520091227968 tf_logging.py:115] Running local_init_op. I0114 10:51:08.641624 140520091227968 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 451.3 +/- 0.0 (jitter = 0.0) 9.972 10 images/sec: 442.4 +/- 3.0 (jitter = 10.1) 9.875 20 images/sec: 437.3 +/- 2.6 (jitter = 11.4) 9.991 30 images/sec: 437.7 +/- 1.9 (jitter = 10.6) 9.940 40 images/sec: 435.9 +/- 1.7 (jitter = 11.5) 9.903 50 images/sec: 435.1 +/- 1.7 (jitter = 11.4) 9.852 60 images/sec: 435.0 +/- 1.5 (jitter = 11.6) 10.081 70 images/sec: 434.4 +/- 1.3 (jitter = 10.8) 9.750 80 images/sec: 433.5 +/- 1.3 (jitter = 11.2) 9.976 90 images/sec: 434.0 +/- 1.3 (jitter = 11.1) 10.096 100 images/sec: 433.9 +/- 1.3 (jitter = 12.0) 9.894 ---------------------------------------------------------------- total images/sec: 433.70 ---------------------------------------------------------------- ``` ### 1X V100 (FP16) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --use_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 40 global 40.0 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:52:54.713240 139707288196928 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:52:56.445768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-14 10:52:56.445845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:52:56.445854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-14 10:52:56.445861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-14 10:52:56.446789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) I0114 10:52:59.471376 139707288196928 tf_logging.py:115] Running local_init_op. I0114 10:52:59.798634 139707288196928 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 233.3 +/- 0.0 (jitter = 0.0) 9.861 10 images/sec: 242.1 +/- 1.7 (jitter = 3.4) 10.049 20 images/sec: 239.6 +/- 1.6 (jitter = 3.6) 10.145 30 images/sec: 239.3 +/- 1.1 (jitter = 2.7) 9.811 40 images/sec: 239.2 +/- 1.1 (jitter = 3.4) 10.123 50 images/sec: 239.0 +/- 0.9 (jitter = 2.7) 9.817 60 images/sec: 239.2 +/- 0.8 (jitter = 3.3) 10.192 70 images/sec: 239.8 +/- 0.8 (jitter = 3.3) 9.654 80 images/sec: 239.6 +/- 0.9 (jitter = 3.6) 10.112 90 images/sec: 239.9 +/- 0.9 (jitter = 3.7) 10.224 100 images/sec: 240.1 +/- 0.8 (jitter = 3.8) 9.839 ---------------------------------------------------------------- total images/sec: 239.86 ---------------------------------------------------------------- ``` ### 4X V100 (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 160 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:55:41.717653 140648141563712 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:55:47.053235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-01-14 10:55:47.061802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:55:47.061839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-01-14 10:55:47.061854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2019-01-14 10:55:47.061866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2019-01-14 10:55:47.061883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2019-01-14 10:55:47.061898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2019-01-14 10:55:47.065978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 2019-01-14 10:55:47.071735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0) 2019-01-14 10:55:47.077161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0) 2019-01-14 10:55:47.083051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0) I0114 10:55:57.681849 140648141563712 tf_logging.py:115] Running local_init_op. I0114 10:56:00.601779 140648141563712 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 419.4 +/- 0.0 (jitter = 0.0) 9.840 10 images/sec: 420.1 +/- 1.5 (jitter = 5.6) 9.836 20 images/sec: 418.6 +/- 1.2 (jitter = 6.6) 9.980 30 images/sec: 417.4 +/- 1.0 (jitter = 6.2) 9.957 40 images/sec: 416.5 +/- 0.8 (jitter = 5.8) 9.959 50 images/sec: 417.1 +/- 0.7 (jitter = 5.6) 9.787 60 images/sec: 417.5 +/- 0.7 (jitter = 5.5) 9.930 70 images/sec: 417.6 +/- 0.6 (jitter = 5.7) 9.836 80 images/sec: 417.6 +/- 0.6 (jitter = 5.6) 9.950 90 images/sec: 417.9 +/- 0.5 (jitter = 5.3) 9.857 100 images/sec: 418.0 +/- 0.5 (jitter = 5.1) 9.797 ---------------------------------------------------------------- total images/sec: 417.89 ---------------------------------------------------------------- ``` ### 2X V100 (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0,1 python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 80 global 40.0 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0', '/gpu:1'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 10:59:46.353019 140055177267008 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 10:59:48.894146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2019-01-14 10:59:48.895102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 10:59:48.895117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2019-01-14 10:59:48.895124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y 2019-01-14 10:59:48.895129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N 2019-01-14 10:59:48.897894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) 2019-01-14 10:59:48.902274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0) I0114 10:59:54.778629 140055177267008 tf_logging.py:115] Running local_init_op. I0114 10:59:56.147655 140055177267008 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 240.6 +/- 0.0 (jitter = 0.0) 9.819 10 images/sec: 238.5 +/- 0.7 (jitter = 1.6) 9.804 20 images/sec: 238.2 +/- 0.5 (jitter = 2.1) 10.034 30 images/sec: 238.4 +/- 0.3 (jitter = 2.0) 9.844 40 images/sec: 238.3 +/- 0.3 (jitter = 2.0) 10.095 50 images/sec: 237.9 +/- 0.3 (jitter = 2.5) 9.891 60 images/sec: 238.0 +/- 0.3 (jitter = 2.4) 9.858 70 images/sec: 237.8 +/- 0.3 (jitter = 2.6) 9.949 80 images/sec: 237.6 +/- 0.3 (jitter = 2.5) 9.899 90 images/sec: 237.5 +/- 0.2 (jitter = 2.7) 9.777 100 images/sec: 237.5 +/- 0.2 (jitter = 2.6) 9.791 ---------------------------------------------------------------- total images/sec: 237.43 ---------------------------------------------------------------- ``` ### 1X V100 (FP32) * Command Line ```bash export CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \ --model=resnet152 \ --data_format=NCHW \ --variable_update=replicated \ --nodistortions \ --nouse_fp16 ``` * Output ```bash ... TensorFlow: 1.12 Model: resnet152 Dataset: imagenet (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: False Batch size: 40 global 40.0 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: None ========== Generating training model Initializing graph W0114 11:01:48.743038 139865749051200 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2019-01-14 11:01:50.330534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-01-14 11:01:50.331216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-14 11:01:50.331783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-01-14 11:01:50.331805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-01-14 11:01:50.332710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0) I0114 11:01:52.607747 139865749051200 tf_logging.py:115] Running local_init_op. I0114 11:01:52.804631 139865749051200 tf_logging.py:115] Done running local_init_op. Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 122.4 +/- 0.0 (jitter = 0.0) 9.924 10 images/sec: 123.3 +/- 0.6 (jitter = 2.2) 9.732 20 images/sec: 124.4 +/- 0.4 (jitter = 1.6) 10.058 30 images/sec: 124.6 +/- 0.3 (jitter = 1.0) 9.818 40 images/sec: 124.9 +/- 0.2 (jitter = 0.9) 10.044 50 images/sec: 125.1 +/- 0.2 (jitter = 1.0) 9.893 60 images/sec: 125.0 +/- 0.2 (jitter = 1.1) 9.798 70 images/sec: 125.1 +/- 0.2 (jitter = 1.1) 9.733 80 images/sec: 125.1 +/- 0.2 (jitter = 1.1) 9.947 90 images/sec: 125.1 +/- 0.1 (jitter = 1.1) 9.631 100 images/sec: 125.1 +/- 0.1 (jitter = 1.2) 9.861 ---------------------------------------------------------------- total images/sec: 125.05 ---------------------------------------------------------------- ``` # Summary #### Geforce RTX 2080Ti | Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency | |------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:--:| | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 1 | 188.72 | 1x|- | | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 2 | 330.38 | 1.75x|88% | | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 4 | 430.91 | 2.28x|57%| | Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency | |------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:| | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 1 | 112.24 | 1x | - | | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 2 | 204.94 | 1.83x | 92% | | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 4 | 282.24 | 2.51x | 63% | #### Tesla V100 | Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency | |------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:| | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 1 | 239.86 | 1x | - | | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 2 | 433.70 | 1.8x |90% | | ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 4 | 710.49 | 2.96x |74% | | Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency | |------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:| | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 1 | 125.05 | 1x | - | | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 2 | 237.43 | 1.90x |95%| | ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 4 | 417.89 | 3.34x | 84%|