Lecture 3: Loss Functions and Optimization

--- tags: cs231 --- # Lecture 3: Loss Functions and Optimization ## SVM QA in class ### 1. What happens to loss if car scores change a bit? If we jiggle the scores for this car image a little bit, the loss will not change. Because the car score is already quite a bit larger than the others. We’ll still get zero loss. ### 2. What is the min/max possible loss for SVM? #### 答案：Max(infinite), min(zero) ![](https://i.imgur.com/TkXcDd1.png) ### 3. At initialization W is small so all s ≈ 0. What is the loss when you’re using multi class SVM? - 答案：class數量-1 - 原本公式：![](https://i.imgur.com/LwO6fkB.png) - 當在initial時，此時loss = C - 但因 ( $S_j$ = $S_y$_i)，排除自己(C-1)  ### 4. What if the sum was over all classes? (including $j = y_i$) - 答案：loss+1 - 原本 SVM loss 是加總所有 incorrect classes - 因為題目提到( $S_j$ = $S_y$_i) - 若是加上correct class($j = y_i$)，則 loss+1 ### 5. What if we used mean instead of sum? - 答案：沒差 - 我們只在乎 correct value > incorrect value，不在乎大多少 ### 6. What if we used![](https://i.imgur.com/Ds231g0.png)，is this the same with previous one? - 原本公式：![](https://i.imgur.com/LwO6fkB.png) - 答案：不同演算法，原本的是線性的含式，平方項是非線性函示 --- ## Regularization ### 1. 為何要有Regularization Q: Suppose that we found a W such that L=0. Is this W unique? A: No! 2W is also has L=0 How do we choose between W and 2W? ### 2. Regularization - Prefer simpler models \begin{equation} \begin{split} L(W)&=\dfrac{1}{N}\sum_{i=1}^{N}L_{i}(f(x_{i},W),y_{i})+\lambda R(W)\\ &=Data\ loss+Regularization \end{split} \end{equation} ### 3. Examples - L2 regularization: $R(W)=\sum_{k}\sum_{l}W_{k,l}^{2}$ - L1 regularization: $R(W)=\sum_{k}\sum_{l}|W_{k,l}|$ - Elastic net: $R(W)=\sum_{k}\sum_{l}\beta W_{k,l}^{2}+|W_{k,l}|$ => Comparison - L1會把不重要的特徵直接歸零(輸出稀疏)，而L2不會。原因詳見<第四組問題一>解答。 - L2計算方便且有唯一解，L1則非。 - Timing: L2 is a good default, but if you suspect that only a few features are actually useful, you should prefer L1 or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over L1 since L1 may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.   ## Softmax QA in class ### Softmax Loss ![](https://i.imgur.com/uu79fKD.png=5*5), where![](https://i.imgur.com/lq7xs04.png) 1. softmax 取 exponential，為了讓值會落在 > 0 的範圍 2. 標準化到所有值的 summation = 1 ，符合機率值特性 ### 1. What is the min/max possible loss for softmax? :::warning min: 0 當預測正確類別機率為 1 時，loss function $-log(1) = 0$ max: infinity 預測正確類別機率越小，$-log(x)$越大，當機率逼近 0 時 loss 就會逼近無限大 ![](https://i.imgur.com/dGpmvIU.png) ::: ### 2. At initialization all s will be approximately equal; what is the loss? :::warning Let: $s_{yi} \approx s_j \approx X$ $-log(\frac{e^{s_{yi}}}{\sum_j e^{s_j}}) \approx -log(\frac{X}{CX}) = -log(\frac{1}{C}) = log(C)$ ::: ## SVM vs. Softmax ### Suppose I take a datapoint and change its score slightly. What happens to the loss in both cased? :::warning - SVM: 只要差距都大於 threshold，基本上就不會再優化 W - Softmax: 會持續不斷的優化，讓正確的 class 的 prob. 逼近 1 (socre > 正無限大)，錯的 class prob. 逼近 0 (score > 負無限大) ::: --- ## Optimization-follow the slope ### 計算梯度有兩種方法，一種是數值方法，一種是解析方法。 ### 1. Numerical gradient: approximate, slow, easy to write - $\dfrac{df(x)}{dx}\approx\dfrac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}$ ### 2. Analytic gradient: exact, fast, error-prone - If $f(x)=x^2$ then $f'(x)=2x$ - error-prone: 微分會有人為計算錯誤 ### gradient check: use analytic gradient, but check implementation with numerical gradient 通常我們會使用解析方法，但會使用數值方法來確認正確性。例子： $$ \begin{aligned} &f(x)=5x_{1}^2+2x_{2}^2+7x_{1}x_{2}+10x_{1}+x_{2}+1 \\ &\nabla f(x)＝ \begin{bmatrix} 10x_{1}+7x_{2}+10\\ 4x_{2}+7x_{1}+1 \end{bmatrix} = \begin{bmatrix} 0\\ 0 \end{bmatrix}\\ &\begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix} = \begin{bmatrix} \dfrac{11}{3}\\ \dfrac{-20}{3} \end{bmatrix} \end{aligned} $$ 當$\begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix} = \begin{bmatrix} \dfrac{11}{3}\\ \dfrac{-20}{3} \end{bmatrix}$，$f(x)=16$；但當$\begin{bmatrix} x_{1}\\ x_{2} \end{bmatrix} = \begin{bmatrix} -20\\ 20 \end{bmatrix}$，$f(x)=-179$ --- ## Image Features vs. ConvNets ![](https://i.imgur.com/cPMzgty.png) - 過去的方法：從圖中extract出features(color histogram...)，在訓練過程中，feature是固定的，只有model會被更新，直到重新的訓練循環 - CNN的方法：方法流程大致一樣，只差在 “Feature是模型自己從data中學習得到的“ --- ## 問題 | 組別 |<center> 問題 </center>| |:--------:| -------- | |第一組|1.SVM loss的公式中，$S_i-S_j$如果不加1，而是加小於1的值，是不是可能會影響到W的結果？ 2.建模時都會用regularization嗎？ 3.regularization對training and validation loss造成的影響？| |第二組|<center>報告組</center>| |第三組|1.Gradient descent 中，有沒有方法自動調整learning rate? 2.為什麼做gradient descent ，loss還是有可能變大?| |第四組|1. 為什麼L1傾向於讓$W$變成0，L2傾向於將係數變小 2. scores外面可以不用log嗎？| ## 回答 ### 第一組 #### Q1 SVM loss的公式中，$S_i-S_j$ 如果不加1，而是加小於1的值，是不是可能會影響到W的結果？ - threshold越大，信心越有:thumbsup: - 但不管加什麼值都會影響到$W$的結果，只是改變效果有差 - 如果threshold小，很容易就會超過標準($S_j +1 > S_{y_i}$) #### Q2 建模時都會用regularization嗎？為了避免overfitting，通常會加入regularization。 #### Q3 regularization對training and validation loss造成的影響？ training loss上升，validation loss下降。 ### 第三組 #### Q1 Gradient descent 中，有沒有方法自動調整learning rate? - Adagrad算法則是在學習過程中對學習率不斷的調整，這種技巧叫做「學習率衰減(Learning rate decay)」。通常在神經網路學習，一開始會用大的學習率，接著在變小的學習率，從上述例子可以發現，大的學習率可以較快走到最佳值或是跳出局部極值，但越後面到要找到極值就需要小的學習率。 - $W \leftarrow W - \eta \dfrac{1}{\sqrt{n+\epsilon}}\dfrac{\partial L}{\partial W}$ $n = \sum_{r=1}^t (\dfrac{\partial L}{\partial W})^2$ - n為前面所有梯度值的平方和，利用前面學習的梯度值平方和來調整learning rate ，ϵ 為平滑值，加上 ϵ 的原因是為了不讓分母為0，ϵ 一般值為1e-8。 - 前期n較小，能夠放大學習率；後期n較大，能夠約束學習率。 #### Q2 為什麼做 gradient descent ，loss 還是有可能變大? - 在 gradient descent 的過程中，如果 learning rate 太大 (gradient step)，可能會發生超過 local minima 的情況，因此可能發生 loss 變大的情況。 ### 第四組 #### Q1 為什麼L1傾向於讓$W$變成0，L2傾向於將係數變小? - L1在$z$等於0的時候，不可微分，所以會利用subgradient的性質 ![](https://i.imgur.com/32snTuR.png) [利用subgradient性質去證明L1傾向於讓$W$變成0](https://zhuanlan.zhihu.com/p/30535220)。 - L2為連續可微，不需要使用subgradient性質，所以沒有強制變0的限制。 #### Q2 scores外面可以不用log嗎？ - softmax![](https://i.imgur.com/6NUj4FC.jpg=10*10) - logsoftmax![](https://i.imgur.com/IfoCraF.jpg=10*10) =>重點在於反向傳播的部分，logsoftmax和softmax相比少了一項，因此就不會有softmax分項極大或極小的狀況下，誤差回傳效果極差的現象．