PyTorch錯誤筆記

# PyTorch錯誤筆記 ## weight是拼接而成的 Composed Weight. ### 問題描述 I compose conv2d weight by fewer variable. 有天心血來潮，想只用少量變數構成捲積的權重。例如變數a,b，組合成捲積權重w： > \[\[a,0,b\], > \[a,0,b\], > \[a,0,b\]\] 使用torch.nn.functional.conv2d作捲積，程式如下： ```python=3.7 import torch import torch.optim as optim import torch.nn.functional as F def concatAB(a, b): c = torch.stack((a, 0, b), -1) return torch.stack((c, c, c), 0).unsqueeze(0).unsqueeze(0) a = torch.tensor(1.0, requires_grad=True) b = torch.tensor(-1.2, requires_grad=True) optimizer = optim.Adam([a, b], lr=0.001) w = concatAB(a, b) inputs = torch.randn(1, 1, 5, 5) out = torch.sum(F.conv2d(inputs, w, padding=1)**2) out.backward() #沒問題 optimizer.step() optimizer.zero_grad() out = torch.sum(F.conv2d(inputs, w, padding=1)**2) out.backward() #RuntimeError ''' RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time. ''' ``` ### 除錯 RuntimeError告訴我由於上次backward已經把中間計算刪掉了，若真的有需要兩次backward，考慮寫成out.backward(retain_graph=True)。但問題是，以前實驗過，若權重為leaf tensor，即沒有經過任何張量操作（e.g. cat, stack, add, mul, etc.），沒有這個問題。所以作了以下操作檢查： 1. 印出w，發現w沒有更新。 2. 於是在第二次計算前插入w = concatAB(a, b)更新捲積的權重。其實到這裡問題就解決了，但又好奇在不更新w的情況下會有什麼錯誤，就把原始程式碼的out.backward()改成out.backward(retain_graph=True)： ```python=3.7 out = torch.sum(F.conv2d(inputs, w, padding=1)**2) out.backward(retain_graph=True) #沒問題 optimizer.step() optimizer.zero_grad() out = torch.sum(F.conv2d(inputs, w, padding=1)**2) out.backward() #RuntimeError ''' RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor []] is at version 11; expected version 10 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). ``` 其中裡面的version 11是leaf tensor a,b的更新次數，version 10是構成w的a,b的更新次數，pytorch透過計數leaf tensor的更新次數來防止使用者誤用上一個step的東西。所以把optimizer去掉，a,b沒有更新是沒有error的。 ### 結論 1. 一樣要使用組合的權重的人在訓練時都要記得更新權重。 2. 訓練GAN之類會需要多次backward的人，可以使用多個optimizer避免多個模型的權重更新衝突造成錯誤。 3. 或用tensor.detach()讓generator的假圖片梯度截斷，再輸入給discriminator。 4. 一般得用backward(retain_graph=True)是不太正常的事，訓練的設計可能有問題。如果從什麼地方知道了特殊的訓練技巧是OK的，但如果是執行後才被RuntimeError提示的話還是不要亂用retain_graph=True。