# 2080Ti and V100 Benchmarks
Experiments are performed using synthesis data (ImageNet).
* Hardware
* LeadTek GS4820A
* 4X GeForce RTX 2080 Ti + 4x Tesla V100
* Software
The environment is packed as a docker image: ```honghu/keras:tf-cu10.0-dnn7.4-py3-avx2-19.01``` and is [available on DockerHub](https://hub.docker.com/r/honghu/keras/).
We run the benchmarks using the official TensorFlow repo:
```bash
git clone --branch cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
```
### 4X GeForce RTX 2080 Ti (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:15:47.726546 140230479963968 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:15:53.902021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-01-14 10:15:53.916469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:15:53.916481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-01-14 10:15:53.916488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N N
2019-01-14 10:15:53.916493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N N N
2019-01-14 10:15:53.916498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N N
2019-01-14 10:15:53.916502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N N N
2019-01-14 10:15:53.918515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-01-14 10:15:53.923486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-01-14 10:15:53.930171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10168 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-01-14 10:15:53.937095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10168 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0114 10:16:04.913285 140230479963968 tf_logging.py:115] Running local_init_op.
I0114 10:16:08.349156 140230479963968 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 432.3 +/- 0.0 (jitter = 0.0) 10.080
10 images/sec: 431.3 +/- 0.8 (jitter = 1.8) 9.924
20 images/sec: 431.6 +/- 0.6 (jitter = 2.6) 9.975
30 images/sec: 431.0 +/- 0.5 (jitter = 1.8) 9.900
40 images/sec: 431.5 +/- 0.5 (jitter = 2.1) 9.846
50 images/sec: 431.4 +/- 0.5 (jitter = 2.2) 9.870
60 images/sec: 431.0 +/- 0.4 (jitter = 2.5) 9.968
70 images/sec: 431.0 +/- 0.5 (jitter = 3.2) 9.862
80 images/sec: 431.3 +/- 0.5 (jitter = 3.6) 9.932
90 images/sec: 431.2 +/- 0.4 (jitter = 3.3) 9.967
100 images/sec: 431.0 +/- 0.4 (jitter = 3.0) 9.935
----------------------------------------------------------------
total images/sec: 430.91
----------------------------------------------------------------
```
### 4X GeForce Quadro RTX 8000 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating model
W0306 08:43:44.548509 139684100622144 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-06 08:43:51.797984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0c:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:43:52.348729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0d:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:43:52.925093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0e:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:43:53.639499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0f:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:43:53.718676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-06 08:43:56.796882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-06 08:43:56.796930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-06 08:43:56.796938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-03-06 08:43:56.796943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-03-06 08:43:56.796948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-03-06 08:43:56.796953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-03-06 08:43:56.818119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-03-06 08:43:56.820568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-03-06 08:43:56.821078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-03-06 08:43:56.821575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0306 08:44:08.992272 139684100622144 tf_logging.py:115] Running local_init_op.
I0306 08:44:13.005366 139684100622144 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 779.5 +/- 0.0 (jitter = 0.0) 9.891
10 images/sec: 746.1 +/- 8.7 (jitter = 22.5) 9.761
20 images/sec: 744.8 +/- 4.8 (jitter = 22.0) 9.710
30 images/sec: 748.3 +/- 4.7 (jitter = 20.6) 9.929
40 images/sec: 750.6 +/- 3.7 (jitter = 18.2) 9.950
50 images/sec: 751.1 +/- 3.2 (jitter = 17.5) 9.790
60 images/sec: 751.3 +/- 2.7 (jitter = 14.7) 9.806
70 images/sec: 751.1 +/- 2.4 (jitter = 15.8) 9.851
80 images/sec: 750.6 +/- 2.4 (jitter = 15.6) 9.820
90 images/sec: 750.2 +/- 2.2 (jitter = 16.7) 9.871
100 images/sec: 750.5 +/- 2.0 (jitter = 17.0) 9.786
----------------------------------------------------------------
total images/sec: 750.02
----------------------------------------------------------------
```
### 4X V100 PCIE 32GB (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating model
W0306 08:50:32.981509 140223759439680 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-06 08:50:41.077971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:04:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-03-06 08:50:41.921124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:06:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2019-03-06 08:50:42.780290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:07:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-03-06 08:50:43.709868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:08:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-03-06 08:50:43.782253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-06 08:50:45.838085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-06 08:50:45.838139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-06 08:50:45.838146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-03-06 08:50:45.838150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-03-06 08:50:45.838155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-03-06 08:50:45.838160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-03-06 08:50:45.846207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
2019-03-06 08:50:45.848779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14846 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
2019-03-06 08:50:45.849314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30378 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0)
2019-03-06 08:50:45.849784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0)
I0306 08:50:56.379636 140223759439680 tf_logging.py:115] Running local_init_op.
I0306 08:51:00.715517 140223759439680 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 736.6 +/- 0.0 (jitter = 0.0) 9.910
10 images/sec: 807.3 +/- 10.0 (jitter = 24.3) 9.756
20 images/sec: 810.1 +/- 8.2 (jitter = 28.5) 9.725
30 images/sec: 820.2 +/- 6.6 (jitter = 19.1) 9.924
40 images/sec: 814.2 +/- 6.2 (jitter = 33.7) 9.938
50 images/sec: 808.9 +/- 5.7 (jitter = 33.0) 9.798
60 images/sec: 811.7 +/- 5.1 (jitter = 36.4) 9.818
70 images/sec: 808.6 +/- 4.8 (jitter = 35.1) 9.846
80 images/sec: 808.5 +/- 4.4 (jitter = 33.0) 9.831
90 images/sec: 806.9 +/- 4.2 (jitter = 33.4) 9.909
100 images/sec: 805.3 +/- 4.0 (jitter = 33.5) 9.778
----------------------------------------------------------------
total images/sec: 804.73
----------------------------------------------------------------
```
### 4X Quadro RTX 8000 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=256 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 1024 global
256.0 per device
Num batches: 100
Num epochs: 0.08
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating model
W0306 08:55:52.772576 139757467600704 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-06 08:56:01.631109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0c:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:56:02.232898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0d:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:56:03.182707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0e:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:56:04.382598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0f:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 08:56:04.436815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-06 08:56:06.766942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-06 08:56:06.766985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-06 08:56:06.766992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-03-06 08:56:06.766997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-03-06 08:56:06.767001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-03-06 08:56:06.767005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-03-06 08:56:06.774166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-03-06 08:56:06.775358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-03-06 08:56:06.775978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-03-06 08:56:06.776465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0306 08:56:19.953084 139757467600704 tf_logging.py:115] Running local_init_op.
I0306 08:56:28.303954 139757467600704 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 1013.0 +/- 0.0 (jitter = 0.0) 9.850
10 images/sec: 1007.5 +/- 1.2 (jitter = 3.1) 9.880
20 images/sec: 1005.7 +/- 0.8 (jitter = 4.1) 9.823
30 images/sec: 1003.9 +/- 0.8 (jitter = 4.4) 9.780
40 images/sec: 1002.4 +/- 0.7 (jitter = 4.9) 9.705
50 images/sec: 1000.6 +/- 0.8 (jitter = 5.1) 9.689
60 images/sec: 999.8 +/- 0.7 (jitter = 5.0) 9.654
70 images/sec: 999.0 +/- 0.7 (jitter = 5.6) 9.625
80 images/sec: 998.0 +/- 0.7 (jitter = 5.5) 9.599
90 images/sec: 996.9 +/- 0.7 (jitter = 6.6) 9.580
100 images/sec: 996.3 +/- 0.7 (jitter = 6.3) 9.559
----------------------------------------------------------------
total images/sec: 996.13
----------------------------------------------------------------
```
### 4X Quadro RTX 8000 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=96 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 384 global
96.0 per device
Num batches: 100
Num epochs: 0.03
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating model
W0306 09:02:11.327498 140704704161600 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-06 09:02:18.744726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0c:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:02:19.604581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0d:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:02:20.804147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0e:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:02:21.381372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0f:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:02:21.416588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-06 09:02:23.722670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-06 09:02:23.722729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-06 09:02:23.722750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-03-06 09:02:23.722766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-03-06 09:02:23.722780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-03-06 09:02:23.722793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-03-06 09:02:23.748770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-03-06 09:02:23.750335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-03-06 09:02:23.751031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-03-06 09:02:23.751535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0306 09:02:34.783475 140704704161600 tf_logging.py:115] Running local_init_op.
I0306 09:02:39.586142 140704704161600 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 877.0 +/- 0.0 (jitter = 0.0) 9.871
10 images/sec: 876.2 +/- 1.8 (jitter = 1.9) 9.873
20 images/sec: 878.4 +/- 1.3 (jitter = 4.2) 9.847
30 images/sec: 876.8 +/- 1.1 (jitter = 4.9) 9.839
40 images/sec: 874.9 +/- 1.0 (jitter = 5.8) 9.806
50 images/sec: 874.3 +/- 0.9 (jitter = 6.1) 9.893
60 images/sec: 873.2 +/- 0.9 (jitter = 6.6) 9.823
70 images/sec: 872.3 +/- 0.8 (jitter = 7.2) 9.892
80 images/sec: 871.4 +/- 0.8 (jitter = 6.6) 9.790
90 images/sec: 870.2 +/- 0.8 (jitter = 7.7) 9.795
100 images/sec: 869.6 +/- 0.8 (jitter = 7.0) 9.746
----------------------------------------------------------------
total images/sec: 869.30
----------------------------------------------------------------
```
### 4X Quadro RTX 8000 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=480 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 1920 global
480.0 per device
Num batches: 100
Num epochs: 0.15
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating model
W0306 09:09:39.098187 140650919057216 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1761: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-06 09:09:50.061734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0c:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:09:50.927108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0d:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:09:52.016024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0e:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:09:52.936119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0f:00.0
totalMemory: 47.43GiB freeMemory: 47.23GiB
2019-03-06 09:09:53.042688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-06 09:09:55.918460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-06 09:09:55.918506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-06 09:09:55.918514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-03-06 09:09:55.918519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-03-06 09:09:55.918524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-03-06 09:09:55.918528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-03-06 09:09:55.926480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 45863 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-03-06 09:09:55.927890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45863 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-03-06 09:09:55.928336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 45863 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-03-06 09:09:55.928650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 45863 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0306 09:10:10.047497 140650919057216 tf_logging.py:115] Running local_init_op.
I0306 09:10:24.649841 140650919057216 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 1070.7 +/- 0.0 (jitter = 0.0) 9.861
10 images/sec: 1074.4 +/- 1.2 (jitter = 2.9) 9.772
20 images/sec: 1073.4 +/- 0.9 (jitter = 4.8) 9.730
30 images/sec: 1071.3 +/- 0.9 (jitter = 5.6) 9.652
40 images/sec: 1069.7 +/- 0.8 (jitter = 6.0) 9.612
50 images/sec: 1067.9 +/- 0.8 (jitter = 6.4) 9.572
60 images/sec: 1066.7 +/- 0.8 (jitter = 6.3) 9.546
70 images/sec: 1065.6 +/- 0.8 (jitter = 6.3) 9.524
80 images/sec: 1064.6 +/- 0.7 (jitter = 5.6) 9.526
90 images/sec: 1063.8 +/- 0.7 (jitter = 5.6) 9.521
100 images/sec: 1063.1 +/- 0.7 (jitter = 5.4) 9.516
----------------------------------------------------------------
total images/sec: 1062.99
----------------------------------------------------------------
```
### 2X GeForce RTX 2080 Ti (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5
python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 80 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:24:47.426653 140179035752256 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:24:50.765500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-01-14 10:24:50.767682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:24:50.767695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-01-14 10:24:50.767701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N
2019-01-14 10:24:50.767706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N
2019-01-14 10:24:50.768625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-01-14 10:24:50.773317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5)
I0114 10:24:56.086123 140179035752256 tf_logging.py:115] Running local_init_op.
I0114 10:24:57.728647 140179035752256 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 336.0 +/- 0.0 (jitter = 0.0) 9.968
10 images/sec: 333.8 +/- 2.0 (jitter = 3.9) 9.874
20 images/sec: 331.7 +/- 1.4 (jitter = 5.7) 9.975
30 images/sec: 331.1 +/- 1.0 (jitter = 4.7) 9.926
40 images/sec: 332.1 +/- 0.9 (jitter = 5.2) 9.879
50 images/sec: 331.2 +/- 0.8 (jitter = 5.3) 9.827
60 images/sec: 331.1 +/- 0.7 (jitter = 4.8) 10.041
70 images/sec: 330.6 +/- 0.6 (jitter = 5.3) 9.767
80 images/sec: 330.7 +/- 0.6 (jitter = 5.6) 9.959
90 images/sec: 330.7 +/- 0.6 (jitter = 5.3) 10.099
100 images/sec: 330.5 +/- 0.5 (jitter = 5.6) 9.871
----------------------------------------------------------------
total images/sec: 330.38
----------------------------------------------------------------
```
### 1X GeForce RTX 2080 Ti (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 40 global
40.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:27:58.245692 140021949130560 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:27:59.638065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-14 10:27:59.638142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:27:59.638150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-14 10:27:59.638156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-14 10:27:59.638413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
I0114 10:28:01.636178 140021949130560 tf_logging.py:115] Running local_init_op.
I0114 10:28:01.808087 140021949130560 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 183.9 +/- 0.0 (jitter = 0.0) 9.857
10 images/sec: 187.8 +/- 0.5 (jitter = 1.2) 10.075
20 images/sec: 189.2 +/- 0.8 (jitter = 2.6) 10.146
30 images/sec: 188.8 +/- 0.5 (jitter = 2.0) 9.800
40 images/sec: 189.2 +/- 0.5 (jitter = 2.6) 10.111
50 images/sec: 189.4 +/- 0.4 (jitter = 2.5) 9.862
60 images/sec: 189.2 +/- 0.3 (jitter = 2.2) 10.208
70 images/sec: 189.0 +/- 0.3 (jitter = 2.1) 9.681
80 images/sec: 189.0 +/- 0.3 (jitter = 2.1) 10.123
90 images/sec: 188.9 +/- 0.3 (jitter = 2.2) 10.175
100 images/sec: 188.8 +/- 0.3 (jitter = 2.1) 9.838
----------------------------------------------------------------
total images/sec: 188.72
----------------------------------------------------------------
```
### 4X GeForce RTX 2080 Ti (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:31:28.515695 140609529079616 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:31:33.855940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-01-14 10:31:33.873226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:31:33.873242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-01-14 10:31:33.873249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N N
2019-01-14 10:31:33.873253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N N N
2019-01-14 10:31:33.873257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N N
2019-01-14 10:31:33.873261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N N N
2019-01-14 10:31:33.877902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-01-14 10:31:33.883196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2019-01-14 10:31:33.888513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10168 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2019-01-14 10:31:33.894109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10168 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:0f:00.0, compute capability: 7.5)
I0114 10:31:43.655935 140609529079616 tf_logging.py:115] Running local_init_op.
I0114 10:31:46.790667 140609529079616 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 282.5 +/- 0.0 (jitter = 0.0) 9.868
10 images/sec: 283.4 +/- 0.4 (jitter = 1.5) 9.824
20 images/sec: 283.7 +/- 0.3 (jitter = 1.7) 9.978
30 images/sec: 283.5 +/- 0.3 (jitter = 2.0) 9.960
40 images/sec: 283.4 +/- 0.2 (jitter = 2.0) 9.980
50 images/sec: 283.1 +/- 0.2 (jitter = 1.8) 9.801
60 images/sec: 282.9 +/- 0.2 (jitter = 1.6) 9.911
70 images/sec: 282.7 +/- 0.2 (jitter = 1.7) 9.863
80 images/sec: 282.5 +/- 0.2 (jitter = 1.7) 9.933
90 images/sec: 282.4 +/- 0.2 (jitter = 1.7) 9.830
100 images/sec: 282.3 +/- 0.2 (jitter = 1.6) 9.798
----------------------------------------------------------------
total images/sec: 282.24
----------------------------------------------------------------
```
### 2X GeForce RTX 2080 Ti (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4,5
python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 80 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:34:25.444249 140291245438784 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:34:28.001151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-01-14 10:34:28.003420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:34:28.003436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-01-14 10:34:28.003443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N
2019-01-14 10:34:28.003448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N
2019-01-14 10:34:28.004899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2019-01-14 10:34:28.011071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10168 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:0d:00.0, compute capability: 7.5)
I0114 10:34:33.320369 140291245438784 tf_logging.py:115] Running local_init_op.
I0114 10:34:34.740233 140291245438784 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 205.7 +/- 0.0 (jitter = 0.0) 9.789
10 images/sec: 206.1 +/- 0.3 (jitter = 1.3) 9.812
20 images/sec: 205.9 +/- 0.4 (jitter = 1.5) 9.996
30 images/sec: 205.9 +/- 0.3 (jitter = 1.5) 9.851
40 images/sec: 205.6 +/- 0.3 (jitter = 1.3) 10.102
50 images/sec: 205.5 +/- 0.2 (jitter = 1.3) 9.877
60 images/sec: 205.3 +/- 0.2 (jitter = 1.5) 9.866
70 images/sec: 205.2 +/- 0.2 (jitter = 1.4) 9.916
80 images/sec: 205.1 +/- 0.2 (jitter = 1.5) 9.897
90 images/sec: 205.1 +/- 0.2 (jitter = 1.5) 9.799
100 images/sec: 205.0 +/- 0.2 (jitter = 1.5) 9.787
----------------------------------------------------------------
total images/sec: 204.94
----------------------------------------------------------------
```
### 1X GeForce RTX 2080 Ti (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=4
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 40 global
40.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:38:45.863283 140694985090880 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:38:47.438329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-14 10:38:47.438394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:38:47.438402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-14 10:38:47.438408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-14 10:38:47.439365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10168 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:0c:00.0, compute capability: 7.5)
I0114 10:38:49.174678 140694985090880 tf_logging.py:115] Running local_init_op.
I0114 10:38:49.365254 140694985090880 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 115.1 +/- 0.0 (jitter = 0.0) 9.865
10 images/sec: 113.0 +/- 0.6 (jitter = 1.2) 9.741
20 images/sec: 112.8 +/- 0.4 (jitter = 1.5) 10.067
30 images/sec: 112.9 +/- 0.3 (jitter = 1.2) 9.834
40 images/sec: 112.9 +/- 0.2 (jitter = 1.1) 10.052
50 images/sec: 113.0 +/- 0.2 (jitter = 0.9) 9.889
60 images/sec: 113.0 +/- 0.2 (jitter = 1.0) 9.771
70 images/sec: 112.8 +/- 0.2 (jitter = 1.2) 9.697
80 images/sec: 112.6 +/- 0.2 (jitter = 1.3) 9.946
90 images/sec: 112.5 +/- 0.1 (jitter = 1.3) 9.611
100 images/sec: 112.3 +/- 0.1 (jitter = 1.6) 9.870
----------------------------------------------------------------
total images/sec: 112.24
----------------------------------------------------------------
```
### 4X V100 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:47:45.852792 140483332290368 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:47:52.202071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-01-14 10:47:52.204629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:47:52.204641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-01-14 10:47:52.204647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-01-14 10:47:52.204669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-01-14 10:47:52.204673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-01-14 10:47:52.204678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-01-14 10:47:52.207206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
2019-01-14 10:47:52.212262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
2019-01-14 10:47:52.217073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)
2019-01-14 10:47:52.220718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0)
I0114 10:48:03.024006 140483332290368 tf_logging.py:115] Running local_init_op.
I0114 10:48:06.137128 140483332290368 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 698.3 +/- 0.0 (jitter = 0.0) 10.070
10 images/sec: 719.6 +/- 4.9 (jitter = 13.5) 9.957
20 images/sec: 720.6 +/- 3.1 (jitter = 13.6) 9.992
30 images/sec: 711.8 +/- 3.6 (jitter = 24.0) 9.895
40 images/sec: 711.0 +/- 3.5 (jitter = 24.3) 9.861
50 images/sec: 710.3 +/- 3.0 (jitter = 23.3) 9.889
60 images/sec: 710.1 +/- 2.7 (jitter = 22.5) 9.974
70 images/sec: 711.8 +/- 2.5 (jitter = 21.8) 9.875
80 images/sec: 710.3 +/- 2.5 (jitter = 21.8) 9.925
90 images/sec: 710.7 +/- 2.3 (jitter = 21.3) 9.980
100 images/sec: 710.8 +/- 2.2 (jitter = 20.1) 9.956
----------------------------------------------------------------
total images/sec: 710.49
----------------------------------------------------------------
```
### 2X V100 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0,1
python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 80 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:50:57.990253 140520091227968 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:51:01.123257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-01-14 10:51:01.124415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:51:01.124431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-01-14 10:51:01.124438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
2019-01-14 10:51:01.124443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
2019-01-14 10:51:01.126344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
2019-01-14 10:51:01.133181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
I0114 10:51:07.232126 140520091227968 tf_logging.py:115] Running local_init_op.
I0114 10:51:08.641624 140520091227968 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 451.3 +/- 0.0 (jitter = 0.0) 9.972
10 images/sec: 442.4 +/- 3.0 (jitter = 10.1) 9.875
20 images/sec: 437.3 +/- 2.6 (jitter = 11.4) 9.991
30 images/sec: 437.7 +/- 1.9 (jitter = 10.6) 9.940
40 images/sec: 435.9 +/- 1.7 (jitter = 11.5) 9.903
50 images/sec: 435.1 +/- 1.7 (jitter = 11.4) 9.852
60 images/sec: 435.0 +/- 1.5 (jitter = 11.6) 10.081
70 images/sec: 434.4 +/- 1.3 (jitter = 10.8) 9.750
80 images/sec: 433.5 +/- 1.3 (jitter = 11.2) 9.976
90 images/sec: 434.0 +/- 1.3 (jitter = 11.1) 10.096
100 images/sec: 433.9 +/- 1.3 (jitter = 12.0) 9.894
----------------------------------------------------------------
total images/sec: 433.70
----------------------------------------------------------------
```
### 1X V100 (FP16)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--use_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 40 global
40.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:52:54.713240 139707288196928 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:52:56.445768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-14 10:52:56.445845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:52:56.445854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-14 10:52:56.445861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-14 10:52:56.446789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
I0114 10:52:59.471376 139707288196928 tf_logging.py:115] Running local_init_op.
I0114 10:52:59.798634 139707288196928 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 233.3 +/- 0.0 (jitter = 0.0) 9.861
10 images/sec: 242.1 +/- 1.7 (jitter = 3.4) 10.049
20 images/sec: 239.6 +/- 1.6 (jitter = 3.6) 10.145
30 images/sec: 239.3 +/- 1.1 (jitter = 2.7) 9.811
40 images/sec: 239.2 +/- 1.1 (jitter = 3.4) 10.123
50 images/sec: 239.0 +/- 0.9 (jitter = 2.7) 9.817
60 images/sec: 239.2 +/- 0.8 (jitter = 3.3) 10.192
70 images/sec: 239.8 +/- 0.8 (jitter = 3.3) 9.654
80 images/sec: 239.6 +/- 0.9 (jitter = 3.6) 10.112
90 images/sec: 239.9 +/- 0.9 (jitter = 3.7) 10.224
100 images/sec: 240.1 +/- 0.8 (jitter = 3.8) 9.839
----------------------------------------------------------------
total images/sec: 239.86
----------------------------------------------------------------
```
### 4X V100 (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 160 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:55:41.717653 140648141563712 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:55:47.053235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-01-14 10:55:47.061802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:55:47.061839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-01-14 10:55:47.061854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2019-01-14 10:55:47.061866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2019-01-14 10:55:47.061883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2019-01-14 10:55:47.061898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2019-01-14 10:55:47.065978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
2019-01-14 10:55:47.071735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
2019-01-14 10:55:47.077161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0)
2019-01-14 10:55:47.083051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30378 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0)
I0114 10:55:57.681849 140648141563712 tf_logging.py:115] Running local_init_op.
I0114 10:56:00.601779 140648141563712 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 419.4 +/- 0.0 (jitter = 0.0) 9.840
10 images/sec: 420.1 +/- 1.5 (jitter = 5.6) 9.836
20 images/sec: 418.6 +/- 1.2 (jitter = 6.6) 9.980
30 images/sec: 417.4 +/- 1.0 (jitter = 6.2) 9.957
40 images/sec: 416.5 +/- 0.8 (jitter = 5.8) 9.959
50 images/sec: 417.1 +/- 0.7 (jitter = 5.6) 9.787
60 images/sec: 417.5 +/- 0.7 (jitter = 5.5) 9.930
70 images/sec: 417.6 +/- 0.6 (jitter = 5.7) 9.836
80 images/sec: 417.6 +/- 0.6 (jitter = 5.6) 9.950
90 images/sec: 417.9 +/- 0.5 (jitter = 5.3) 9.857
100 images/sec: 418.0 +/- 0.5 (jitter = 5.1) 9.797
----------------------------------------------------------------
total images/sec: 417.89
----------------------------------------------------------------
```
### 2X V100 (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0,1
python tf_cnn_benchmarks.py --num_gpus=2 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 80 global
40.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 10:59:46.353019 140055177267008 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 10:59:48.894146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-01-14 10:59:48.895102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 10:59:48.895117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-01-14 10:59:48.895124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
2019-01-14 10:59:48.895129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
2019-01-14 10:59:48.897894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
2019-01-14 10:59:48.902274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30378 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
I0114 10:59:54.778629 140055177267008 tf_logging.py:115] Running local_init_op.
I0114 10:59:56.147655 140055177267008 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 240.6 +/- 0.0 (jitter = 0.0) 9.819
10 images/sec: 238.5 +/- 0.7 (jitter = 1.6) 9.804
20 images/sec: 238.2 +/- 0.5 (jitter = 2.1) 10.034
30 images/sec: 238.4 +/- 0.3 (jitter = 2.0) 9.844
40 images/sec: 238.3 +/- 0.3 (jitter = 2.0) 10.095
50 images/sec: 237.9 +/- 0.3 (jitter = 2.5) 9.891
60 images/sec: 238.0 +/- 0.3 (jitter = 2.4) 9.858
70 images/sec: 237.8 +/- 0.3 (jitter = 2.6) 9.949
80 images/sec: 237.6 +/- 0.3 (jitter = 2.5) 9.899
90 images/sec: 237.5 +/- 0.2 (jitter = 2.7) 9.777
100 images/sec: 237.5 +/- 0.2 (jitter = 2.6) 9.791
----------------------------------------------------------------
total images/sec: 237.43
----------------------------------------------------------------
```
### 1X V100 (FP32)
* Command Line
```bash
export CUDA_VISIBLE_DEVICES=0
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=40 \
--model=resnet152 \
--data_format=NCHW \
--variable_update=replicated \
--nodistortions \
--nouse_fp16
```
* Output
```bash
...
TensorFlow: 1.12
Model: resnet152
Dataset: imagenet (synthetic)
Mode: BenchmarkMode.TRAIN
SingleSess: False
Batch size: 40 global
40.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Generating training model
Initializing graph
W0114 11:01:48.743038 139865749051200 tf_logging.py:125] From /workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2157: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-01-14 11:01:50.330534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-14 11:01:50.331216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-14 11:01:50.331783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-14 11:01:50.331805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-14 11:01:50.332710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30378 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:04:00.0, compute capability: 7.0)
I0114 11:01:52.607747 139865749051200 tf_logging.py:115] Running local_init_op.
I0114 11:01:52.804631 139865749051200 tf_logging.py:115] Done running local_init_op.
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 122.4 +/- 0.0 (jitter = 0.0) 9.924
10 images/sec: 123.3 +/- 0.6 (jitter = 2.2) 9.732
20 images/sec: 124.4 +/- 0.4 (jitter = 1.6) 10.058
30 images/sec: 124.6 +/- 0.3 (jitter = 1.0) 9.818
40 images/sec: 124.9 +/- 0.2 (jitter = 0.9) 10.044
50 images/sec: 125.1 +/- 0.2 (jitter = 1.0) 9.893
60 images/sec: 125.0 +/- 0.2 (jitter = 1.1) 9.798
70 images/sec: 125.1 +/- 0.2 (jitter = 1.1) 9.733
80 images/sec: 125.1 +/- 0.2 (jitter = 1.1) 9.947
90 images/sec: 125.1 +/- 0.1 (jitter = 1.1) 9.631
100 images/sec: 125.1 +/- 0.1 (jitter = 1.2) 9.861
----------------------------------------------------------------
total images/sec: 125.05
----------------------------------------------------------------
```
# Summary
#### Geforce RTX 2080Ti
| Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency |
|------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:--:|
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 1 | 188.72 | 1x|- |
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 2 | 330.38 | 1.75x|88% |
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 4 | 430.91 | 2.28x|57%|
| Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency |
|------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:|
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 1 | 112.24 | 1x | - |
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 2 | 204.94 | 1.83x | 92% |
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 4 | 282.24 | 2.51x | 63% |
#### Tesla V100
| Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency |
|------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:|
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 1 | 239.86 | 1x | - |
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 2 | 433.70 | 1.8x |90% |
| ResNet152 | 40 | ImageNet (Synthesis) | FP16+32 | 4 | 710.49 | 2.96x |74% |
| Model | Batch Size | Data | Precision | #GPUs | #Images/sec | Speed-up | Efficiency |
|------------|:------------:|----------------------|:-----------:|:-------:|:-------------:|:---:|:---:|
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 1 | 125.05 | 1x | - |
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 2 | 237.43 | 1.90x |95%|
| ResNet152 | 40 | ImageNet (Synthesis) | FP32 | 4 | 417.89 | 3.34x | 84%|