舉例某間公司工作年資對應薪水的data set,目標為用現有資料去預測特定年資的薪水。 ## simple linear regression 要預測某項資料,可以先把現有資料的分布圖畫出 ```=python= import pandas as pd import matplotlib.pyplot as plt import numpy as np data_url="https://raw.githubusercontent.com/GrandmaCan/ML/main/Resgression/Salary_Data.csv" #取得data data=pd.read_csv(data_url) print(data) #印出data x_axis=data["YearsExperience"] y_axis=data["Salary"] def plt_graph(w,b): plt.scatter(x_axis,y_axis,marker="x",c="red") #繪製散佈圖 plt.title("years-salary") plt.xlabel("years") plt.ylabel("salary") ``` **output:** ![image](https://hackmd.io/_uploads/BkpHGZNzR.png) ![image](https://hackmd.io/_uploads/H1e7eW4fA.png) 在預測之前,我們需要依照現有資料繪出一條迴歸直線,能代表這些資料的走勢。 ![image](https://hackmd.io/_uploads/rJiG7bNM0.png) 我們先預設回歸直線為$y=w*x+b$ 首要目標便是找到合適的$w$和$b$來決定這條直線 ### cost function 為了決定一條最合適的迴歸線,通常會用各個資料點與該直線的誤差程度(距離)做權衡。當所有資料點與迴歸線的誤差總和最小時,則為最佳的迴歸線。 ![image](https://hackmd.io/_uploads/By3T4bNzR.png) **cost function** 代表預測值與實際值的誤差程度,==因為避免有些差值為負數,所以在此用差值的平方表示==。 $definition:$ $f_{cost}=(y-y_{pred})^2$ $y$=實際值 $y_{pred}$=預測值 ```=python= def calculate_cost(x_axis,y_axis,w,b): y_pred=x_axis*w+b #pandas做運算output會自動Map到每一項,出來也是一個matrix cost=(y_axis-y_pred)**2 #cost已經是matrix return cost.sum()/len(x_axis) #每個cost相加取平均 ``` ### gradient desent(梯度下降) 當我們固定w值時,我們可以將不同的b代入calculate_cost中求cost function,此時可以得到以下結果: ```=python= cost_list_b=[] for b in range(-100,100): cost_list_b.append(calculate_cost(x_axis,y_axis,10,b)) #w固定為10 plt.scatter(range(-100,100),cost_list_b) #用scatter為畫很多點 plt.title("cost with different b") plt.xlabel("b") plt.ylabel("cost") plt.show() ``` ![image](https://hackmd.io/_uploads/H1jpbkYGR.png) 同理,當b值固定w值變動時,也可以得到以下結果: ```=python= cost_list_w=[] for w in range(-100,100): cost_list_w.append(calculate_cost(x_axis,y_axis,w,10)) #b固定為10 plt.plot(range(-100,100),cost_list_w) #plot可以直接畫直線 plt.title("cost with different w") plt.xlabel("w") plt.ylabel("cost") plt.show() ``` ![image](https://hackmd.io/_uploads/ryaGfytG0.png) --- 都可以看到會有一個local minimum,所以就可以拓展到我們要同時變更w,b時,會是一個3維的圖形,而cost function 的absolutely minimum便是我們要尋找的點。 **示例圖:** ![image](https://hackmd.io/_uploads/S1xy7kYMC.png) 至於==尋找local minimum的方法,便稱為梯度下降法==,我們會在(cost)-(w or b)上先隨機取一個點,並找出該點的斜率。 **cost function:** $f_{cost}=y-wx-b$ $\frac{\partial f}{\partial w}$:$-2x(y-wx)$ $\frac{\partial f}{\partial b}$:$-2(y-wx-b)$ ![image](https://hackmd.io/_uploads/SJNsvkFG0.png) 此時我們也會定義一個==學習率$r$(learning rate)==,每當我們找一次斜率後,==該點就要斜率低的地方跑,也就是往水平方向移動。直到local minimum時,斜率=0,移動的距離為斜率*學習率$r$。== **move forward:** 因為斜率的值<0,因此差一個負號。 w domain:$w-\frac{\partial f}{\partial w}*r$ b domain:$b-\frac{\partial f}{\partial b}*r$ --- 拓展到三維空間,每一個對應到(w$_i$,b$_i$)的cost$_i$ ```=python= cost=np.zeros((201,201)) #存所有cost值的array i=0 for w in range(-100,101): j=0 for b in range(-100,101): cost[i][j]=calculate_cost(x_axis,y_axis,w,b) j+=1 i+=1 ``` 接著構建3d圖表: ```=python= ax=plt.axes(projection="3d") #構建3d圖 ax.set_title("w-b-cost") ax.set_xlabel("w") ax.set_ylabel("b") ax.set_zlabel("cost") ax.xaxis.set_pane_color("white") ax.yaxis.set_pane_color("white") ax.zaxis.set_pane_color("white") ws=np.arange(-100,101) bs=np.arange(-100,101) b_grid,w_grid=np.meshgrid(bs,ws) #meshgrid 構築位置矩陣 ax.plot_surface(w_grid,b_grid,cost) w_index,b_index=np.where(cost==np.min(cost)) #回傳當cost等於最小值w,b的index print(w_index,b_index) print(ws[w_index],bs[b_index]) #cost=最小值時的w,b值 ``` :::warning **extra:** meshgrid()方法可以將兩個一維矩陣轉換成座標矩陣。 ![image](https://hackmd.io/_uploads/ryEu1eKf0.png) ::: ![image](https://hackmd.io/_uploads/SJa21xFzC.png) 計算gradient的method 在這裡是拆成w方向與b方向的斜率 ```=python= def calculate_gradient(x,y,w,b): w_grad=(-2*x*(y-w*x-b)).sum()/len(x) #w方向的斜率 b_grad=(-2*(y-w*x-b)).sum()/len(x) #b方向的斜率 return w_grad,b_grad ``` :::warning **tip:** 微分之後的2倍其實可以省略,因為會間接影響到學習率 ::: 定義梯度下降的方法 ```=python= def gradient_desent(w,b,learning_rate): w_grad,b_grad=calculate_gradient(x_axis,y_axis,w,b) w=(w-w_grad*learning_rate) #更新w值 b=b-b_grad*learning_rate #更新b值 print(f"w={w:.2f},b={b:.2f},cost={calculate_cost(x_axis,y_axis,w,b):.2f}") return w,b ``` 用for loop 遞迴20001次,並在每500次印出w,b,cost ```=python= w=0 b=0 for i in range(20001): w,b=gradient_desent(w,b,0.001) #學習率定為0.001 if(i%500==0): print(f"w={w:.2f},b={b:.2f},cost={calculate_cost(x_axis,y_axis,w,b):.2f},count={i}") ``` ``` 執行結果: w=0.87,b=0.15,cost=5286.08,count=0 w=12.12,b=8.07,cost=140.90,count=500 w=11.42,b=12.74,cost=96.02,count=1000 w=10.88,b=16.32,cost=69.70,count=1500 w=10.47,b=19.06,cost=54.28,count=2000 w=10.15,b=21.16,cost=45.23,count=2500 w=9.91,b=22.76,cost=39.93,count=3000 w=9.73,b=23.99,cost=36.82,count=3500 w=9.59,b=24.93,cost=34.99,count=4000 w=9.48,b=25.66,cost=33.92,count=4500 w=9.39,b=26.21,cost=33.30,count=5000 w=9.33,b=26.63,cost=32.93,count=5500 w=9.28,b=26.95,cost=32.72,count=6000 w=9.25,b=27.20,cost=32.59,count=6500 w=9.22,b=27.39,cost=32.51,count=7000 w=9.20,b=27.54,cost=32.47,count=7500 w=9.18,b=27.65,cost=32.45,count=8000 w=9.17,b=27.73,cost=32.43,count=8500 w=9.16,b=27.80,cost=32.42,count=9000 ##收斂 w=9.15,b=27.85,cost=32.42,count=9500 w=9.14,b=27.89,cost=32.41,count=10000 w=9.14,b=27.92,cost=32.41,count=10500 w=9.13,b=27.94,cost=32.41,count=11000 w=9.13,b=27.95,cost=32.41,count=11500 w=9.13,b=27.97,cost=32.41,count=12000 w=9.13,b=27.98,cost=32.41,count=12500 w=9.13,b=27.99,cost=32.41,count=13000 w=9.13,b=27.99,cost=32.41,count=13500 w=9.13,b=28.00,cost=32.41,count=14000 w=9.13,b=28.00,cost=32.41,count=14500 w=9.13,b=28.00,cost=32.41,count=15000 w=9.12,b=28.00,cost=32.41,count=15500 w=9.12,b=28.01,cost=32.41,count=16000 w=9.12,b=28.01,cost=32.41,count=16500 w=9.12,b=28.01,cost=32.41,count=17000 w=9.12,b=28.01,cost=32.41,count=17500 w=9.12,b=28.01,cost=32.41,count=18000 w=9.12,b=28.01,cost=32.41,count=18500 w=9.12,b=28.01,cost=32.41,count=19000 w=9.12,b=28.01,cost=32.41,count=19500 w=9.12,b=28.01,cost=32.41,count=20000 ``` 在大約9000次左右,在w=9.12,b=28.01時,cost function收斂為32.41 ![image](https://hackmd.io/_uploads/B1b3LXFf0.png) **在3d graph中的移動軌跡:** 當初始w為0,b=0的情況 ![image](https://hackmd.io/_uploads/H17wqQtfR.png) 當初始w為-50,b=0的情況 ![image](https://hackmd.io/_uploads/rJ0c57tz0.png) ::::danger **learning 不能太大或太小** 太大的情況會造成cost在水平軸方向來回震盪,無法收斂到local minimum 而太小的情況則會使水平方向的位移太慢,在真正的local minimum前就提早收斂 ![image](https://hackmd.io/_uploads/HyCDwmtM0.png) :::: 回到最初的問題,有了w,b值即可求回歸直線 $y=w*x+b=9.12x+28.01$ ## multiple linear regression 當有多筆參數時需要一起考量時,就需要用多元線性迴歸。 $y=w_1x_1+w_2x_2+w_3x_3....+b$ 原始資料: ![image](https://hackmd.io/_uploads/Sk2iez9GC.png) 此時除了原本的YearsExperience外還有EducationalLevel與City的因素,但這兩個==因素不是數字,因此要先對資料進行預處理(Label Encoding )== * 如果該特徵可以用數字表現大小關係,label encoding 即可 ### label encoding * 如果該特徵可以用數字表現大小關係,label encoding 即可 * 將每特徵依照權重轉換成數字map到每一項種類 ```=python= data["EducationLevel"]=data["EducationLevel"].map({"高中以下":0,"大學":1,"碩士以上":2}) ``` ![image](https://hackmd.io/_uploads/ByMgEf9f0.png) ### one hot encoding * 如果該特徵無法用數字表現其高低關係,則要one hot encoding * 把每個特徵的每一項種類轉換成0 or 1,並增加新的column(升維) ```=python from sklearn.preprocessing import OneHotEncoder ``` ```=python= oneHotEncoder=OneHotEncoder() #創建oneHotEncoder class的物件 oneHotEncoder.fit(data[["EducationLevel"]]) #2d array!!! transformed_city=oneHotEncoder.transform(data[["EducationLevel"]]).toarray() #將sparse matrix轉成普通matrix data[["cityA","cityB","cityC"]]=transformed_city #原本的City拆成cityA,cityB,cityC data=data.drop(["City"],axis=1) #原本City刪除 data ``` ![image](https://hackmd.io/_uploads/S1c1oG5GR.png) 而有時one hot encoding 分開的特徵可以省略掉其中一項,因為由其他項可以推導到剩餘的一項(可以從cityA,cityB知道cityC),但要注意不是所有情況都能省略。 ![image](https://hackmd.io/_uploads/HkDx3DqG0.png) --- ```=python from sklearn.model_selection import train_test_split ``` train_test_split這個method可以將data分成訓練集&測試集: ``` train_test_split(x,y,test_size...) x可以為多組特徵,通常為2D array,每組x對應到一個y(value,1D array) test_size為設定為測試集的比例 ``` ```=python= x=data[["YearsExperience","EducationLevel","cityA","cityB"]] #2d array!! y=data["Salary"] x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2) #test_size:0.2部分當測試集,每一次都是random x_train=x_train.to_numpy() #轉成numpy array方便計算 y_train=y_train.to_numpy() x_train ``` ![image](https://hackmd.io/_uploads/Bk95Zu9GR.png) 如要算出$y=w_1x_1+w_2x_2+w_3x_3....+b$的$y$值 ```=python= w=np.array([1,2,3,4]) #隨意的w (numpy array) b=1 #隨意b (x_train*w).sum(axis=1) #個別元素相乘(內積),每一組array橫向做加總,為w1x1+w2x2+w3x3... y_pred=(x_train*w).sum(axis=1)+b ``` ![image](https://hackmd.io/_uploads/BkGW7ucM0.png) ![image](https://hackmd.io/_uploads/B1NcmO5zR.png) --- 那在multiple linear regression中,一樣也需要cost function來評估迴歸線的好壞。 並用gradient desent,找到最佳的特徵組合(ExperienceYear,EducationLevel,city...) :::warning **cost function:** $definition:$ $f_{cost}=(y-y_{pred})^2$ $y$=實際值 $y_{pred}$=預測值 :::: 然而,在這裡的$y_{pred}$為$y_{pred}=w_1x_1+w_2x_2+w_3x_3...+b=(\sum_{i=1}^{n}w_ix_i)$+b $f_{cost}=[y-(w_1x_1+w_2x_2+w_3x_3...+b)]^2$ 要找到The minimum of cost function也是要用梯度下降法(gradient desent),對比於前面找cost-w/cost-b的斜率,在這裡則是要求五個特徵的斜率($x_i$為1-D array,有4個特徵,以及b) :::info 簡單來說,就是會有cost-w1,cost-w2,cost-w3...cost-b五張graph 找到適合的[w1,w2,w3,w4]和b :::: $f_{cost}=[y-(w_1x_1+w_2x_2+w_3x_3...+b)]^2$ $\frac{\partial f}{\partial w_1}=2*x_1(y_{pred}-y)$ $\frac{\partial f}{\partial w_2}=2*x_2(y_{pred}-y)$ $\frac{\partial f}{\partial w_3}=2*x_3(y_{pred}-y)$ $\frac{\partial f}{\partial w_4}=2*x_4(y_{pred}-y)$ $\frac{\partial f}{\partial b}=2*(y_{pred}-y)$ ```=python= def calculate_cost(x,y,w,b): y_pred=(x*w).sum(axis=1)+b cost=((y-y_pred)**2).mean() return cost ``` ```=pythn= def calculate_gradient(x,y,w,b): y_pred=((x*w).sum(axis=1)+b) b_gradient=(y_pred-y) w_gradient=np.zeros(x.shape[1]) for i in range(x.shape[1]): w_gradient[i]=(x[:,i]*(y_pred-y)).mean() return w_gradient,b_gradient.mean() ``` ```=python= def gradient_desent(w,b,learning_rate): cost=calculate_cost(x_train,y_train,w,b) w_grad,b_grad=calculate_gradient(x_train,y_train,w,b) w=(w-w_grad*learning_rate) #更新w值 b=b-b_grad*learning_rate #更新b值 if i%500==0: print(f"w={w},b={b:.3f},cost={cost:.3f}") return w,b ``` ```=python= w=np.array([1,2,3,4]) b=0 learning_rate=0.01 np.set_printoptions(precision=3) for i in range(10000): w,b=gradient_desent(w,b,learning_rate) w_final=w b_final=b y_pred=(w_final*x_test).sum(axis=1)+b_final frame=pd.DataFrame({ #比對預測結果與測試集 "pred":y_pred, "test":y_test }) ``` ``` w=[3.19 2.511 3.062 4.12 ],b=0.378,cost=1628.915,i=0 w=[ 4.442 11.28 3.656 4.031],b=5.312,cost=127.206,i=100 w=[ 3.18 15.711 4.638 3.746],b=8.367,cost=62.053,i=200 w=[ 2.412 18.161 5.658 3.488],b=10.483,cost=37.056,i=300 w=[ 1.933 19.546 6.55 3.269],b=11.957,cost=26.503,i=400 w=[ 1.626 20.349 7.269 3.09 ],b=12.988,cost=21.714,i=500 w=[ 1.426 20.827 7.825 2.945],b=13.711,cost=19.435,i=600 w=[ 1.293 21.12 8.243 2.831],b=14.219,cost=18.319,i=700 w=[ 1.204 21.305 8.552 2.742],b=14.576,cost=17.764,i=800 w=[ 1.143 21.424 8.779 2.672],b=14.827,cost=17.485,i=900 w=[ 1.101 21.503 8.944 2.617],b=15.004,cost=17.344,i=1000 w=[ 1.072 21.556 9.063 2.575],b=15.129,cost=17.273,i=1100 w=[ 1.052 21.593 9.149 2.543],b=15.217,cost=17.236,i=1200 w=[ 1.038 21.618 9.211 2.518],b=15.278,cost=17.218,i=1300 w=[ 1.028 21.636 9.255 2.498],b=15.322,cost=17.209,i=1400 w=[ 1.021 21.648 9.287 2.483],b=15.353,cost=17.204,i=1500 w=[ 1.016 21.657 9.31 2.472],b=15.374,cost=17.201,i=1600 w=[ 1.013 21.664 9.326 2.463],b=15.389,cost=17.200,i=1700 w=[ 1.011 21.668 9.338 2.456],b=15.400,cost=17.199,i=1800 #收斂 w=[ 1.009 21.672 9.346 2.451],b=15.408,cost=17.199,i=1900 ``` 結果: ![image](https://hackmd.io/_uploads/rk3_hj3zR.png) --- ### 特徵縮放(Feature Scaling) 考慮多筆特徵時,$y=w_1x_1+w_2x_2+w_3x_3....+b$ 特徵$x_1$,$x_2$,$x_3$的範圍差異太大會影響到尋找cost function最小值的過程。 **參考圖:** ![image](https://hackmd.io/_uploads/H1RR2eaMC.png) (圖源:吳恩達機器學習課程) 當$x$的範圍很大時,$w$的範圍會很小 反之,當$x$的範圍很小時,$w$的範圍會很大 所以將$w_i$作圖會得到橢圓的圖案,尋找minimum的路徑可能如以上,會出現來回震盪,降低gradient desent的效率。 所以要做的便是把特徵的數字範圍統一,有兩個方法。 **(1)Normalization:(常規化)** $x_{norm}=\frac{x-x_{min}}{x_{max}-x_{min}}$ **(2)Standardization:(標準化)** $x_{std}=\frac{x-\mu_x}{\sigma}$ ```=python= from sklearn.preprocessing import StandardScaler scaler=StandardScaler() scaler.fit(x_train) x_train=scaler.transform(x_train) x_test=scaler.transform(x_test) ``` **標準化前:** ``` w=[3.19 2.511 3.062 4.12 ],b=0.378,cost=1628.915,i=0 w=[ 4.442 11.28 3.656 4.031],b=5.312,cost=127.206,i=100 w=[ 3.18 15.711 4.638 3.746],b=8.367,cost=62.053,i=200 w=[ 2.412 18.161 5.658 3.488],b=10.483,cost=37.056,i=300 w=[ 1.933 19.546 6.55 3.269],b=11.957,cost=26.503,i=400 w=[ 1.626 20.349 7.269 3.09 ],b=12.988,cost=21.714,i=500 w=[ 1.426 20.827 7.825 2.945],b=13.711,cost=19.435,i=600 w=[ 1.293 21.12 8.243 2.831],b=14.219,cost=18.319,i=700 w=[ 1.204 21.305 8.552 2.742],b=14.576,cost=17.764,i=800 w=[ 1.143 21.424 8.779 2.672],b=14.827,cost=17.485,i=900 w=[ 1.101 21.503 8.944 2.617],b=15.004,cost=17.344,i=1000 w=[ 1.072 21.556 9.063 2.575],b=15.129,cost=17.273,i=1100 w=[ 1.052 21.593 9.149 2.543],b=15.217,cost=17.236,i=1200 w=[ 1.038 21.618 9.211 2.518],b=15.278,cost=17.218,i=1300 w=[ 1.028 21.636 9.255 2.498],b=15.322,cost=17.209,i=1400 w=[ 1.021 21.648 9.287 2.483],b=15.353,cost=17.204,i=1500 w=[ 1.016 21.657 9.31 2.472],b=15.374,cost=17.201,i=1600 w=[ 1.013 21.664 9.326 2.463],b=15.389,cost=17.200,i=1700 w=[ 1.011 21.668 9.338 2.456],b=15.400,cost=17.199,i=1800 #收斂 w=[ 1.009 21.672 9.346 2.451],b=15.408,cost=17.199,i=1900 ``` ![image](https://hackmd.io/_uploads/r1n9qMTG0.png) **標準化後:** ``` w=[1.07 2.152 2.889 3.957],b=0.477,cost=2532.122,i=0 w=[ 3.5 9.282 -1.515 0.377],b=30.390,cost=335.767,i=100 w=[ 3.386 10.851 -2.025 -1.278],b=41.339,cost=59.739,i=200 w=[ 3.149 11.331 -2.12 -1.898],b=45.347,cost=22.935,i=300 w=[ 3. 11.506 -2.157 -2.12 ],b=46.814,cost=17.979,i=400 w=[ 2.92 11.577 -2.177 -2.199],b=47.351,cost=17.306,i=500 w=[ 2.88 11.606 -2.188 -2.227],b=47.547,cost=17.214,i=600 w=[ 2.859 11.619 -2.194 -2.238],b=47.619,cost=17.201,i=700 w=[ 2.849 11.624 -2.197 -2.242],b=47.646,cost=17.199,i=800 #收斂 w=[ 2.844 11.627 -2.198 -2.243],b=47.655,cost=17.199,i=900 w=[ 2.842 11.628 -2.199 -2.244],b=47.659,cost=17.199,i=1000 w=[ 2.841 11.629 -2.199 -2.244],b=47.660,cost=17.199,i=1100 w=[ 2.84 11.629 -2.199 -2.244],b=47.660,cost=17.199,i=1200 w=[ 2.84 11.629 -2.199 -2.244],b=47.661,cost=17.199,i=1300 ``` ![image](https://hackmd.io/_uploads/HyyboMaf0.png) :::info feature scaling 後的data,可以加快gradient desent的收斂速度。 ::: ## reference [【QA】代價函數到底是什麼? 跟損失函數不一樣嗎?](https://www.cupoy.com/qa/club/ai_tw/0000016D6BA22D97000000016375706F795F72656C656173654B5741535354434C5542/0000017D0854F9B9000000016375706F795F72656C656173655155455354) [理解numpy中的meshgrid()方法 ](https://wangyeming.github.io/2018/11/12/numpy-meshgrid/) [Gradient Descent in Machine Learning: A Basic Introduction](https://builtin.com/data-science/gradient-descent)