--- tags: optimization,gradient descent --- gradient descent === ## 基本數學原理 梯度下降演算法是目前在機器學習中最常被廣泛使用於優化參數的方式,使用迭代的方式往最佳解方向的進行搜索,在搜索的過程中,不斷更新參數找到最佳解。在機器學習中,我們會先定義損失函數(loss function)或者稱為目標變數(objective function),我們通常希望某參數值可以使此函數可以逼近最小值,進而找到最好的模型,接著我們會以數學的角度說明: Let $f(\theta)$ be the objective function which is differentiable and $\theta$ is a finite set of parameters.Now,given a particular point $\theta_0$ and we would like to find the dirtection $d$ s.t. $f(\theta_0+d)$ is minimized as following: $$f(\theta_0+d)<f(\theta_0)$$ In order to find the steepest direction,we can approximate the loss function via a first-order Taylor expansion as follow: $$f(\theta_0+d) \approx f(x)+ \bigtriangledown f(\theta_0)d$$ The direction $d$ that minimizes the loss function implies the following optimization problem: $$min_{d:||d||=1} \bigtriangledown f(\theta_0)d$$ Let we find $d^* \in argmax_{||d||=1} \bigtriangledown f(\theta_0)^Td$.From the Cauchy-Schwarz inequality we know that $\bigtriangledown f(\theta_0)^Td \le ||\bigtriangledown f(\theta_0)||\,||d||$ with equality when $d=\lambda \bigtriangledown f(\theta_0),\lambda \in \mathbb{R}$. Since $||d||=1$ this implies: $$d^* = \frac{\bigtriangledown f(\theta_0)}{||\bigtriangledown f(\theta_0)||}$$ So,we aim to minimize the function,we seek $\hat{d} \in argmin_{||d||=1} \bigtriangledown f(\theta)$ which is simple $-d^*$,so the iteration of steepest descent in $l_2$ norm are : $$\theta^{(t+1)} = \theta^{(t)} - \eta^{(t)} \bigtriangledown f(\theta^{(t)}) \tag{9}$$ where $\eta^{(t)}$ is some step size(think about this as some tern mutiplied by $||\bigtriangledown f(\theta^{(t)})||^{-1}$).Notice that this is the gradient descent algorithm. 由上述數學式子(9)說明,我們利用對參數偏微分$(\bigtriangledown f(\theta^{(t)})$,取的此參數的斜率切線,並往負的方向移動,也就是減去負的斜率,並根據 $\eta$ 來決定移動的大小,然而隨著迭代次數,漸漸逼近最佳解,直到沒有明顯改善為止,我們就會認為此解是近似最佳解,視意圖如下: ![](https://github.com/WangJengYun/ML-DL-notes/blob/master/Machine%20Learning/images/Background/Grandient%20Descent.png?raw=true) ## Multi-parameter for Gradient Descent 現在我們假設有兩個參數藉由梯度下降來找到最佳解,數學公式如下: Suppose $f(\theta_1,\theta_2)$ is the objective function which is differentiable and $\{\theta_1,\theta_2\}$ are parameters.Now,given a particular point $\{\theta_1^{*},\theta_2^*\}$ and we would like to find the dirtection ${d_1,d_2}$ and let $f(\theta_1,\theta_2)$ is minimized, so we first define the Taylor expansion of loss function as following: $$f(\theta_1,\theta_2)\approx f(\theta_1^*,\theta_2^*)+(\theta_1-\theta_1^*)\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_1}+(\theta_2-\theta_2^*)\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_2}+...$$ In the following image,$(a,b)$ is the $(\theta_1^*,\theta_2^*)$,we want to find the $\theta_1$ and $\theta_2$ in the red circle as $(\theta_1-\theta_1^*)^2+(\theta_2-\theta_2^*)^2 \le d^2$ and we can minimize $f(\theta_1,\theta_2)$ ![](https://github.com/WangJengYun/ML-DL-notes/blob/master/Deep%20Learning/image/optimization/gradient%20descent/Multi-parameter%20for%20Gradient%20Descent_1.PNG?raw=true) we can find the best solution at the inner product of $(\theta_1-\theta_1^*,\theta_2-\theta_1^*)$ and $(\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_1},\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_2})$ as following: $$ \begin{bmatrix} \theta_1-\theta_1^* \\ \theta_2-\theta_1^* \end{bmatrix} =-\eta \begin{bmatrix} \\\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_1} \\ \\\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_2} \end{bmatrix} \Rightarrow \begin{bmatrix} \theta_1\\ \theta_2 \end{bmatrix}= \begin{bmatrix} \theta_1^*\\ \theta_1^* \end{bmatrix} -\eta \begin{bmatrix} \\\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_1} \\ \\\frac{\partial f(\theta_1^*,\theta_2^*)}{\partial \theta_2} \end{bmatrix} $$ If the red circle(learning rate) is not small enough,this will satisfy the Taylor Series. ## 隨機梯度下降法(Stochastic Gradient Descent) Stochastic Gradient Descent performs a parameter update for each training example $x^{(i)}$ and $y^{(i)}$: $$\theta^{t+1} = \theta^{t}-\eta^{t}f(\theta^{t};x^{(i)};y^{(i)})$$ SGD是只使用雖機抽取一筆樣本進行參數更新,一位樣本是隨機抽取,所以才會稱隨機梯度下降法,然而我們可以下圖得知,一般梯度下降與隨機梯度下降的差異,我們發現發現SGD比一般梯度下降更快收斂至最佳解,但是如果learning rate太大,則會容易造成參數呈現鋸齒狀的更新,會造成沒有效率的路徑。 ![](https://github.com/WangJengYun/ML-DL-notes/blob/master/Deep%20Learning/image/optimization/gradient%20descent/Stochastic%20Gradient%20Descent.PNG?raw=true) ## 其他注意事項 ### 特徵縮放(Feature Scaling) 在進行參數最佳化的過程中,特徵的尺度也會影響參數收斂過程,我們可以看到左圖,發現在特徵的尺度較大時,右圖需要較多迭代次數才會收斂,然而我們將該尺度進行正規化時,收斂次數則會減小,所以做參數最佳化前,將特徵進行特徵縮放。 ![](https://github.com/WangJengYun/ML-DL-notes/blob/master/Deep%20Learning/image/optimization/gradient%20descent/Feature%20Scaling.PNG?raw=true)