cs231n Lecture2-LossFunc&Optimization

--- tags: cs231n, computer vision, deep learning --- # cs231n Lecture2-LossFunc&Optimization ## Overview * To know how good the classifier is ![](https://i.imgur.com/iqPsqsM.png) * The Loss Function: ![](https://i.imgur.com/XXvKP3k.png) * The following course introduces 2 linear classifier * **Recall - Linear Classifier:** ![](https://i.imgur.com/XA372Or.png) * **Multi-class SVM:** * loss function: ![](https://i.imgur.com/bSLLrCg.png) **p.s. score在linear classifier就是Wx(=f)** **SVM加總max(0, 其他class的score-期望class的score+bias)** * Example Code: ``` def L_i_vectorized(x,y,W): score = W.dot(x) margins = np.maximum(0,score-score[y]+1) margins[y] = 0 loss_i = np.sum(margins) return loss_i ``` * Add regularization: ![](https://i.imgur.com/R7WActO.png) ![](https://i.imgur.com/0MjznZ1.png) * **Softmax Classifier:** * **score一樣是Wx，多做exp歸一化，最後只看期望class的機率(P(y=yi,x=xi))** * loss function: ![](https://i.imgur.com/W4NOXYL.png) * softmax example: ![](https://i.imgur.com/31xFRYn.png) * **比較**: 想像如果我們有一張圖car，car的分數很高比其他class都高，然後我們針對這張car的score做一點改動會影響到loss function 的結果嗎??? * In SVM case: 由於SVM只care與原class相減產生的margin，所以微調score(仍是car最大所以減起來高機率仍小於0，但我們要取的是max(0,margin))對SVM的loss function影響甚小甚至沒有影響。 * In Softmax case: 由於 softmax 的 loss function 考慮了全部的score再做歸一化，所以會對 loss function 的結果造成影響。 * 意義: 由於loss function對score敏感，softmax容易藉由微調weight不斷進步，而SVM只要給它超過某個datapoint的score就可以直接無視進行歸類。 * **流程整理**: ![](https://i.imgur.com/bZ2PmyY.png) * **How to choose best weight ?** * Strategy 1 - Random search -> **BAD** ``` bestloss = float('inf') for num in xrange(1000): W = np.random.randn(10, 3073)*0.0001 loss = L(x_train, y_train, W) if(loss < bestloss): bestloss = loss bestweight = W print('in attempt %d, loss= %f, bestloss = %f' % (num, loss, bestloss)) ``` * Strategy 2 - Follow the slope * 想像下山 - 找最低處看方向看梯度 ![](https://i.imgur.com/zNfZ8O0.png) * 回想微積分... ![](https://i.imgur.com/gNGVmv7.png) p.s. 這個f(x)是loss function * 想樣多維vector... 梯度gradient = loss function 對每一維的偏微分 * Numerical gradient - terrible slow ![](https://i.imgur.com/UjKAOwr.png) * Analytic gradient - fast, error-prone The loss is just a function of W，直接對loss function作W微分 ![](https://i.imgur.com/qGI6SYU.png) * Analytic & Numerical 用途由於analytic在速度上具有優勢，通常計算loss我們使用analytic，但numerical可以用來做驗證，確認analytic得到的loss是否正確。 * 知道梯度之後 - 梯度下降(Gradient Descent) ``` while True: weights_grad = evaluate_grad(loss_func, data, weights) weights += -step_size * weights_grad ``` p.s.為啥用減法調整W (看下圖) ![](https://i.imgur.com/PS5BJro.png) * SGD (Stochastic Gradient Descent) - 梯度下降 ![](https://i.imgur.com/UZLjOAX.png) - 當N很大的時候cost of sum 很大 - 所以設定minibatch,只取一部份來算 ``` while True: data_batch = sample_training_data(data, 256) # sample 256 examples weights_grad = evaluate_grad(loss_func, data_batch, weights) weights += -step_size * weights_grad ``` * Image Feature * Why need image feature? 有鑑於我們對於linear classifier的知識，把整張圖直接丟給classifier顯然不是個好方法。 ex: 左手邊的資料很難用linear classifier畫線區分，然而我們進行一個feature transform(本例子是轉極座標)即可簡單使用linear classifier區分紅藍兩點。 ![](https://i.imgur.com/l28J3KX.png) 所以我們可以知道取特徵絕對比直接整張丟進去好。 * 諸多特徵萃取方式 (before DNN) * Color Histogram ![](https://i.imgur.com/pV6S9h0.png) * Histogram of Oriented Gradients(HoG) ![](https://i.imgur.com/4FHvpH9.png) * Bag of words ![](https://i.imgur.com/YSdtCfl.png) * 與 CNN 比較 ![](https://i.imgur.com/10NXI6P.png) - 傳統的特徵萃取在於要把image feature representaion 畫or表達出來，在藉此特徵做訓練 - CNN直接用卷積層萃取高維深度特徵，我們可以把它當成一個黑箱子，訓練可以直接以原圖資料做訓練 - 簡而言之，二步變一步，我們不再需要去考慮如何表達feature representation，不用先對圖片進行特徵萃取處理直接讓卷積網路來搞定。