Overflow & Underflow testing of Float 16
Learn More →
首先設定的部份,先引入tensorflow(2.1版)
我指定給一號GPU
Learn More →
參考 tensorflow2 的文檔限制我的 GPU
用多少開多少
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
或是直接限制某顆只能用多少
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
接著設定 Policy,讓 global dtype變成 mixed precision
tf.keras.mixed_precision.experimental.Policy
tf.keras.mixed_precision.experimental.set_policy
這裡會出現一個INFO告訴你,你的GPU能不能做混合精度運算
Learn More →
開始搭建模型,在展開張量圖之後確定一下運算到底使用的是GPU還是CPU
inputs = keras.Input(shape=(784,), name='digits')
if tf.config.list_physical_devices('GPU'):
print('The model will run with 4096 units on a GPU')
num_units = 4096
else:
# Use fewer units on CPUs so the model finishes in a reasonable amount of time
print('The model will run with 64 units on a CPU')
num_units = 64
接著確認在混合精度計算運算使用float16,但儲存variable是使用float32,前面paper有提過為什麼這麼做,這裡就不贅述
print('x.dtype: %s' % x.dtype.name)
# 'kernel' is dense1's variable
print('dense1.kernel.dtype: %s' % dense1.kernel.dtype.name)
Learn More →
接著要修正最後一層的運算,由於最後的output必須是float32的,但因為前面把Global policy都設定成mixed precision,最後一層在softmax上做修正
Learn More →
這邊做個實驗比較mix_precision,fully float32,fully float16的差異,由於實驗是跑在GPU1上,所以下面的memory都以GPU1的為主
GPU0 memory Usage
Learn More →
Mix_precision
Result
Learn More →
GPU Memory Usage
Learn More →
Float32
Global Policy
Learn More →
Result
Learn More →
GPU memory usage
Learn More →
Float16
Result
Learn More →
GPU memory usage
Learn More →
Float64
Global Policy
Learn More →
這邊稍微改了一下label的設定,改為float64
Learn More →
GPU memory usage
Learn More →
Result
Learn More →
可以看到速度上純粹float16是最快,但直接訓練不出來,mix_precision速度是float32的兩倍,準確度上沒有差很多(我這裡沒有使用同樣的random seed,epochs數也沒拉高,並且只有各做一次,實驗程序上有點小瑕疵,不過主要是簡單測試速度和顯卡內存上的差異)
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up