# 手起刀落 RTX 3080 - ResNet50 Benchmark Test
公司剛拿到一張RTX 3080,該做什麼測試? 當然是拿來打Game 阿不然咧?
不過遊戲測試文別人都發過了,我們就來測點別的,測試看8K的片會不會比舒服@@
舒服完,言歸正傳,你各位資料科學家雖然個個身價不菲,不屑拿這張來做AI trainig,但總會好奇效能吧!
那我們就拿 RTX3080 與 RTX2080 Ti來做個比較吧!
---
測試機型:Leadtek WS2030
* CPU : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
* GPU 0 : RTX 2080 Ti 三風扇版本
* CUDA cores : 4352
* Tensor cores : 544
* memoryClockRate(GHz): 1.635
* GPU 1: RTX 3080 三風扇版本
* CUDA cores : 8704
* Tensor cores : 272
* memoryClockRate(GHz): 1.71
測試環境 :
* NVIDIA Docker Version : 2.4.0
* NVIDIA Driver Version : 455.23.04
* NVIDIA NGC image : nvcr.io/nvidia/tensorflow:20.08-tf1-py3
* CUDA Version : V11.0.221
* TensorFlow Version : 1.15.3
---
萬事俱備,當然趕快對大家跑到廢的[ResNet-50 V1.5](https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_tensorflow) 來上一發,話不多說,直接上指令給各位看官,
```
nvidia-docker run -i --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /SSD/imagenet12tf:/imagenet --rm -w /workspace/nvidia-examples/resnet50v1.5 nvcr.io/nvidia/tensorflow:20.08-tf1-py3 \
python ./main.py \
--mode=training_benchmark \
--use_tf_amp \
--warmup_steps 200 \
--batch_size 128
--data_dir=/imagenet
--results_dir=results
```
以上是使用[ILSVRC2012](http://image-net.org/challenges/LSVRC/2012/) 資料集,已轉成tfrecords。
並以[混精度](https://developer.nvidia.com/automatic-mixed-precision)做訓練。
先來看看 RTX 2080 Ti 效能:
```
Per-GPU Batch Size 128
Current step: 0
Using step learning rate schedule
DLL 2020-09-24 10:13:17.791773 - (0, 200) total_ips : 662.0026535796197
DLL 2020-09-24 10:13:17.985160 - (0, 201) total_ips : 665.3145414236197
DLL 2020-09-24 10:13:18.178255 - (0, 202) total_ips : 664.3595086269943
DLL 2020-09-24 10:13:18.375696 - (0, 203) total_ips : 649.5157850928716
DLL 2020-09-24 10:13:18.569272 - (0, 204) total_ips : 662.4282190709215
DLL 2020-09-24 10:13:18.760637 - (0, 205) total_ips : 669.9898566351184
DLL 2020-09-24 10:13:18.955276 - (0, 206) total_ips : 658.7817730030861
DLL 2020-09-24 10:13:19.148727 - (0, 207) total_ips : 663.1318862795038
DLL 2020-09-24 10:13:19.342272 - (0, 208) total_ips : 662.4274017224786
DLL 2020-09-24 10:13:19.535058 - (0, 209) total_ips : 665.0615199752245
DLL 2020-09-24 10:13:19.729471 - (0, 210) total_ips : 659.4550289701798
DLL 2020-09-24 10:13:19.924658 - (0, 211) total_ips : 656.8529934984058
DLL 2020-09-24 10:13:20.119133 - (0, 212) total_ips : 659.5870901161005
DLL 2020-09-24 10:13:20.313669 - (0, 213) total_ips : 659.0534992861606
DLL 2020-09-24 10:13:20.509854 - (0, 214) total_ips : 653.9637809413252
DLL 2020-09-24 10:13:20.703981 - (0, 215) total_ips : 660.6812584758591
DLL 2020-09-24 10:13:20.897240 - (0, 216) total_ips : 663.5507040025312
DLL 2020-09-24 10:13:21.091686 - (0, 217) total_ips : 659.399951607812
......
......
DLL 2020-09-24 10:45:55.338228 - (0, 10007) total_ips : 661.7513287513189
DLL 2020-09-24 10:45:55.531867 - (0, 10008) total_ips : 662.6784224708173
DLL 2020-09-24 10:45:55.725281 - (0, 10009) total_ips : 663.5728480300002
DLL 2020-09-24 10:45:55.918063 - (0, 10010) total_ips : 665.6667840439619
DLL 2020-09-24 10:45:56.111078 - (0, 10011) total_ips : 664.8358587484706
Ending Model Training ...
DLL 2020-09-24 10:45:57.098866 - () train_throughput : 652.9993126906716
```
再來就是RTX 3080:
```
Per-GPU Batch Size 128
Current step: 0
Using step learning rate schedule
DLL 2020-09-24 11:31:27.347913 - (0, 200) total_ips : 751.749479809903
DLL 2020-09-24 11:31:27.519114 - (0, 201) total_ips : 753.7019059064006
DLL 2020-09-24 11:31:27.698696 - (0, 202) total_ips : 715.1727843961598
DLL 2020-09-24 11:31:27.875753 - (0, 203) total_ips : 726.471169615649
DLL 2020-09-24 11:31:28.049781 - (0, 204) total_ips : 737.9638461971978
DLL 2020-09-24 11:31:28.221583 - (0, 205) total_ips : 746.2251139413635
DLL 2020-09-24 11:31:28.391311 - (0, 206) total_ips : 755.9698834794241
DLL 2020-09-24 11:31:28.560339 - (0, 207) total_ips : 758.6887384650171
DLL 2020-09-24 11:31:28.731989 - (0, 208) total_ips : 747.7217026527522
DLL 2020-09-24 11:31:28.901179 - (0, 209) total_ips : 758.4700819683711
DLL 2020-09-24 11:31:29.072321 - (0, 210) total_ips : 749.9453986701682
DLL 2020-09-24 11:31:29.241195 - (0, 211) total_ips : 760.2863895839795
DLL 2020-09-24 11:31:29.410445 - (0, 212) total_ips : 758.4540092590369
DLL 2020-09-24 11:31:29.580621 - (0, 213) total_ips : 754.3182274177005
DLL 2020-09-24 11:31:29.755719 - (0, 214) total_ips : 733.149767848364
DLL 2020-09-24 11:31:29.925434 - (0, 215) total_ips : 756.6368383296996
DLL 2020-09-24 11:31:30.095305 - (0, 216) total_ips : 755.6730090899622
DLL 2020-09-24 11:31:30.265314 - (0, 217) total_ips : 755.0629679843831
......
......
DLL 2020-09-24 12:00:02.964839 - (0, 10007) total_ips : 734.9261639128828
DLL 2020-09-24 12:00:03.140479 - (0, 10008) total_ips : 730.8840950241645
DLL 2020-09-24 12:00:03.317645 - (0, 10009) total_ips : 724.7216669636446
DLL 2020-09-24 12:00:03.497916 - (0, 10010) total_ips : 712.3506616369959
Ending Model Training ...
DLL 2020-09-24 12:00:04.676413 - () train_throughput : 737.3213028565264
```
蛤?啊怎麼差不多哩??? images per second 才多一成?
CUDA cores 多了四千多個,時脈還比人家高一點,又有聽起來很厲害的[tf32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)加持......
是老闆人品差買到卡王嗎? 再去仔細找一下硬體規格來瞧瞧時,
阿咧! Tensor Cores 比起2080ti怎麼被砍半了呢? 少了兩百七十幾個...
在各大框架支援無痛混精度訓練之時,你老黃手起刀落,我Tensor cores 少一半,根本擺明了告訴我這是一張遊戲卡(O.S :啊本來就是一張遊戲卡啊...)
那至少單精度訓練效能,應該會很厲害吧?
RTX 2080 Ti 單精度效能:
```
Per-GPU Batch Size 64
Current step: 0
Using step learning rate schedule
DLL 2020-09-24 11:52:42.534068 - (0, 200) total_ips : 166.09100758262463
DLL 2020-09-24 11:52:42.621722 - (0, 201) total_ips : 743.5761621690498
DLL 2020-09-24 11:52:42.876775 - (0, 202) total_ips : 251.44011569963058
DLL 2020-09-24 11:52:43.080435 - (0, 203) total_ips : 315.0022366429234
DLL 2020-09-24 11:52:43.310420 - (0, 204) total_ips : 279.01167245612174
DLL 2020-09-24 11:52:43.540366 - (0, 205) total_ips : 279.01312248658905
DLL 2020-09-24 11:52:43.770754 - (0, 206) total_ips : 278.5351401830369
DLL 2020-09-24 11:52:44.000982 - (0, 207) total_ips : 278.82011103551787
DLL 2020-09-24 11:52:44.231769 - (0, 208) total_ips : 278.18731223023076
DLL 2020-09-24 11:52:44.461415 - (0, 209) total_ips : 279.48401820779037
DLL 2020-09-24 11:52:44.693137 - (0, 210) total_ips : 276.96543754552715
DLL 2020-09-24 11:52:44.922950 - (0, 211) total_ips : 279.9784057106382
DLL 2020-09-24 11:52:45.152043 - (0, 212) total_ips : 280.26867993555925
DLL 2020-09-24 11:52:45.381787 - (0, 213) total_ips : 279.40082060548275
DLL 2020-09-24 11:52:45.613257 - (0, 214) total_ips : 277.1607536868146
DLL 2020-09-24 11:52:45.857313 - (0, 215) total_ips : 263.0840630730315
DLL 2020-09-24 11:52:46.085725 - (0, 216) total_ips : 280.96685887257576
DLL 2020-09-24 11:52:46.315955 - (0, 217) total_ips : 278.6053884962797
......
```
RTX 3080 單精度效能:
```
Per-GPU Batch Size 64
Current step: 0
Using step learning rate schedule
DLL 2020-09-25 04:38:04.496081 - (0, 200) total_ips : 214.8337827120063
DLL 2020-09-25 04:38:04.565660 - (0, 201) total_ips : 933.743755282921
DLL 2020-09-25 04:38:04.747036 - (0, 202) total_ips : 353.74587164305376
DLL 2020-09-25 04:38:04.927488 - (0, 203) total_ips : 355.74343207311125
DLL 2020-09-25 04:38:05.108982 - (0, 204) total_ips : 353.46080645309956
DLL 2020-09-25 04:38:05.295723 - (0, 205) total_ips : 343.26257909085797
DLL 2020-09-25 04:38:05.477786 - (0, 206) total_ips : 352.44262206834435
DLL 2020-09-25 04:38:05.657313 - (0, 207) total_ips : 357.38072928523974
DLL 2020-09-25 04:38:05.838949 - (0, 208) total_ips : 353.50968204134097
DLL 2020-09-25 04:38:06.019750 - (0, 209) total_ips : 354.88746106536786
DLL 2020-09-25 04:38:06.203038 - (0, 210) total_ips : 349.7539485444913
DLL 2020-09-25 04:38:06.391324 - (0, 211) total_ips : 340.8453167893666
DLL 2020-09-25 04:38:06.571991 - (0, 212) total_ips : 355.20303179467635
DLL 2020-09-25 04:38:06.752804 - (0, 213) total_ips : 354.81662870912186
DLL 2020-09-25 04:38:06.934124 - (0, 214) total_ips : 353.97210793932106
DLL 2020-09-25 04:38:07.115773 - (0, 215) total_ips : 353.3840292569647
DLL 2020-09-25 04:38:07.296524 - (0, 216) total_ips : 355.56955274887576
DLL 2020-09-25 04:38:07.480445 - (0, 217) total_ips : 348.78643290840887
...
```
嗯... 1.25倍的效能差,不知道各位資料科學家怎麼看怎麼解釋?
附上訓練時的 GPU Status
RTX 2080 Ti:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:3B:00.0 Off | N/A |
| 59% 70C P2 264W / 260W | 8979MiB / 11019MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 7401 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 26982 C python 8969MiB |
+-----------------------------------------------------------------------------+
```
RTX 3080:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 On | 00000000:5E:00.0 Off | N/A |
| 55% 71C P2 312W / 320W | 9291MiB / 10018MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3770 C python 9283MiB |
| 0 N/A N/A 7401 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
```
跑的時候其實也還有一些大大小小的疑惑,包括為什麼RTX 3080 啟動時間比較久,到時再用 [NVIDIA Visual Profiler](https://developer.nvidia.com/nvidia-visual-profiler)來看看,以及用這個NGC image 會有 **WARNING: Detected NVIDIA GeForce RTX 3080 GPU, which is not yet supported in this version of the container** 這訊息是否有影響效能等等......
還有PCIe Gen4真的有什麼不一樣嗎?
比較起來,多了快60W的功耗,少了快一半的價格,差不多的效能...
算了,RTX 3090到手再來煩惱,24GB 記憶體有點香啊,到時加碼測試一下BERT benchmark 好了!
立志成為~~汁男~~資料科學家的我,趕緊要再來用RTX 3080 觀賞其他人 DeepFakes 應用在高解析影像的成果了。
喔對了!你如果問我怎麼可以拿遊戲卡來做AI trainig,使用NGC image......我會跟你說,以上全都是用GPT-3產出的做夢文啦!