[Note] CS229 Lecture 2 : Linear Regression and Gradient Descent

# [Note] CS229 Lecture 2 : Linear Regression and Gradient Descent # Part I Supervised Learning --- 上一篇: https://hackmd.io/@ChiuChiuCircle/cs229_1 --- Online lecture: https://youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&si=vuAUi_ytVYB1Jc8V Main notes: https://cs229.stanford.edu/main_notes.pdf Other resources: https://github.com/maxim5/cs229-2018-autumn/tree/main --- ## Cost Function 接續這個dataset ![IMG_0740](https://hackmd.io/_uploads/H1rrk_Jop.jpg =70%x) 還記得這條式子吧 $$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2$$ $x$是特徵(features)，$\theta$是權重(weights)，我們可以將此式表示為: $$h(x)=\sum^d_{i=0}\theta_ix_i=\theta^Tx$$ >$d:$ 維度(特徵數)，在這個dataset $d=2$， >$\theta^T:$ $\theta$的**轉置矩陣** > >:::spoiler What is 轉置矩陣(動畫) >![Matrix_transpose](https://hackmd.io/_uploads/r1yXOOJop.gif) >就是把矩陣沿45度角線翻轉 >::: 後面那條等式是把$\theta_ix_i$這樣表示: $$h(x)=\sum^d_{i=0}\theta_ix_i=\theta^Tx=\begin{bmatrix} \theta_1\\\theta_2\ \\...\\ \theta_d\end{bmatrix}^T\times\begin{bmatrix}x_1\\x_2\\...\\x_d\end{bmatrix}$$ 為了要優化這條$h_\theta(x)$，我們定義了另一條式子: $$J(\theta)=\frac{1}{2}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})^2$$ >$m:$ training examle數量(訓練資料筆數) 這條式子叫**cost function**， $h_\theta(x^{(i)})$是我們將特徵值$x$輸入後得出的預測值，所以$h_\theta(x^{(i)})-y^{(i)}$就是殘差(預測值與實際值的差值)。最後掛上平方及乘上$\frac{1}{2}$，有沒有發現，這其實就是我們所熟悉的**最小平方法**公式。至於乘上$\frac{1}{2}$的原因我們之後會有統計上的解釋。 --- ## LMS algorithm 我們現在的目標是要最小化$J(\theta)$，實現**最小均方(Least Mean Square, LMS)**。首先，我們會先**隨機初始化$\theta$**，接著重複調整$\theta使得h_\theta(x^{(i)})$盡可能接近$y$，這樣$J(\theta)$就會越來越小。具體來說，有一個演算法叫做**梯度下降(Gradient descent)**，梯度下降調整$\theta$的公式如下: $$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta}J(\theta)$$ >$\partial:$ 偏微分符號，念做der、del >$\alpha:$ 學習率(learning rate) >$:=\;:$ 令...為...，就是程式裡的宣告變數的意思，注意不是等號$(=)$ 這邊的$\theta_j$是全部一起調整的$(j=1,2,...,d)$。而$\alpha$是個浮點數，會影響到每次更新$\theta$時的速度， **$\alpha$越大，更新幅度越大。** 但過大的$\alpha會使得J(\theta)$無法至最優解(就像跑步速度快就無法在精準位置停下) $\alpha$過小則會讓學習速度緩慢，效率低下。 >:::spoiler 學習率動畫(待補) > >::: 為了搞懂右邊那坨東西是怎麼運作的，我們先做一些運算: (在這裡training example數量=1，以讓$J(\theta)$運算方便) $$\begin{aligned}\frac{\partial}{\partial\theta_j}J(\theta) &=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x)-y)^2\\ &=2\cdot\frac{1}{2}(h_\theta(x)-y)\cdot\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\ &=(h_\theta(x)-y)\cdot\frac{\partial}{\partial\theta_j}(\sum^d_{i=0}\theta_ix_i-y)\\ &=(h_\theta(x)-y)x_j \end{aligned}$$ 所以更新$\theta$的式子最後會是: $$\theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$$ 這條式子被稱作**LMS update rule**，又被稱為**Widrow-Hoff learning rule**。從直觀上來看， **預測值與實際相值差越多， $\theta$就會有較大幅度的改動。** 最後可以這樣表示$\theta$經過所有更新後的收斂狀態: $$\theta:=\theta+\alpha\sum^n_{i=1}(y^{(i)}-h_\theta(x^{(i)}))x^{(i)}$$