MIXED PRECISION TRAINING

{%hackmd @themes/dracula %} ![image](https://hackmd.io/_uploads/SyIz14xOp.png) ## FP32 MASTER COPY OF WEIGHTS ![image](https://hackmd.io/_uploads/HkAMFIx_p.png) ==把全部的參數從FP32換到FP16會有甚麼問題?== **溢位**: 約有5%的數值在2^-24以下，乘上lr後會消失 $W = W - \eta * \frac{dl}{dw}$ ![image](https://hackmd.io/_uploads/Bk_lC8xup.png) **加法權重對齊**: 變為二進位與權重對齊時，會向右移造成$\eta * \frac{dl}{dw}$這一項為0 ## LOSS SCALING 從這張圖的剃度值可以發現很大一部分都在2^-24以下 ![image](https://hackmd.io/_uploads/ryoiQwx_T.png) 所以在算backward的前，先把loss去做縮放 ## Example of autocast ```python= # Creates model and optimizer in default precision model = Net().cuda() optimizer = optim.SGD(model.parameters(), ...) for input, target in data: optimizer.zero_grad() # Enables autocasting for the forward pass (model + loss) with torch.autocast(device_type="cuda"): output = model(input) loss = loss_fn(output, target) # Exits the context manager before backward() loss.backward() optimizer.step() ``` ## autocast library 在forward的時候，把精度下降 ```python= class autocast: def __enter__(self): self.prev_cache_enabled = torch.is_autocast_cache_enabled() if self.device == "cpu": self.prev = torch.is_autocast_cpu_enabled() self.prev_fastdtype = torch.get_autocast_cpu_dtype() torch.set_autocast_cpu_enabled(self._enabled) torch.set_autocast_cpu_dtype(self.fast_dtype) # type: ignore[arg-type] torch.autocast_increment_nesting() ``` 取得float16當作forward時的精度 ```cpp= thread_local at::ScalarType autocast_cpu_dtype = at::kBFloat16; at::ScalarType get_autocast_cpu_dtype() { return autocast_cpu_dtype; } ``` ## example of gradscale ```python= scaler = GradScaler() for epoch in epochs: for i, (input, target) in enumerate(data): with autocast(device_type='cuda', dtype=torch.float16): output = model(input) loss = loss_fn(output, target) loss = loss / iters_to_accumulate # Accumulates scaled gradients. scaler.scale(loss).backward() if (i + 1) % iters_to_accumulate == 0: # may unscale_ here if desired (e.g., to allow clipping unscaled gradients) scaler.step(optimizer) scaler.update() optimizer.zero_grad() ```