Logistic Regression : 巨星和矮星的分類

# 0. 前言 <img src="https://upload.wikimedia.org/wikipedia/commons/1/10/Blue_giant_star_artistic_recreation-bpk.jpg" height =ˇ200; style="display:block; margin-left:0; width:100%" /> (一顆漂亮的藍色巨星，[圖片來源](<https://upload.wikimedia.org/wikipedia/commons/1/10/Blue_giant_star_artistic_recreation-bpk.jpg> "https://upload.wikimedia.org/wikipedia/commons/1/10/Blue_giant_star_artistic_recreation-bpk.jpg")) Logistic Regression，一個廣泛被用於許多領域(金融、社會學等...)的模型，主要用於分類問題。今天就跟著筆者豪豪，讓我們以[Kaggle 上巨星和矮星的資料](<https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs> "https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs")，來看看這個模型吧。當巨星(Giant star)和矮星(Dwarf star) 有著相同的表面溫度時，巨星的亮度會比矮行星大。因此我們應該可以根據亮度(Luminosity)和表面溫度(surface temperture)來進行巨星和矮星的分類。在天文學中，我們可以藉由色指數(B-V color index)來衡量恆星的表面溫度，但恆星亮度會隨著距離我們遠近而受到影響，因此有什麼東西可以用來較客觀地量化亮度呢？答案是絕對星等(Absolute magnitude)，因為絕對星等相當於把天體放在相同的距離，去對亮度做比較。因此有了這兩項衡量單位，我們就可以試著使用Logistic regression，去對巨星和矮星做分類囉。簡而言之，底下是我們的目的 :arrow_down: :::info **使用 Logistic regression，猜測 B-V color index 和 Absolute Magnitude 的 Decision Boundary，進而分類巨星和矮星。** &ensp; 資料來源：https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs P.S. Decision Boundary 詳見後面介紹 ::: 廢話不多說，就讓我們馬上開始吧~~~ # 1. 資料分析流程底下是我們的資料分析流程 :::info Step 1. 讀取原始資料 ::: :::warning Step 2. 使用Python畫出資料的散佈圖，猜測 Decision Boundary ::: :::danger Step 3. 對資料進行 Logistic regression，分類巨星和矮星 ::: :::success Step 4. 藉由 Test Data 評估準確率。 ::: ## 1.1 資料讀取首先我們先下載資料，來源裡的「Star3642_balanced.csv」，打開後可以發現裡面有 3642顆恆星的features，欄位如下(欄位描述詳見資料連結網址): | 欄位名稱|Vmag | Plx |e_Plx|B-V|SpType|Amag|TargetClass| | - | - | - |-|-|-|-|-| | 描述| 視星等|地球和恆星的距離| 地球和恆星的距離的標準誤差 |色指數|恆星光譜|絕對星等|是否為巨星，0 代表矮星，1代表巨星| 再來使用下方的 Python Code 進行讀取並觀察資料; ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt # 檔案存放路徑 trainingSet_Path = 'Data\Star3642_balanced.csv' # 使用 panda.read_csv 方法讀取檔案路徑內的資料 star_Df = pd.read_csv(trainingSet_Path) # 觀察資料 print('資料總筆數:', len(star_Df)) print('取得資料欄位名稱:', star_Df.columns) print('前五筆資料如下:') print(star_Df.head(5)) ``` 觀察結果如下: ![](https://hackmd.io/_uploads/BJJ3-5_i2.png) # 2. 畫出資料散佈圖並猜測 Decision Boundary ## 2.1 畫出資料散佈圖接著我們可以將散佈圖畫出來進行觀察 ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt # 訓練集檔案存放路徑 trainingSet_Path = 'Data\Star3642_balanced.csv' # 使用 panda.read_csv 方法讀取檔案路徑內的資料 star_Df = pd.read_csv(trainingSet_Path) # 觀察資料 print('資料總筆數:', len(star_Df)) print('取得資料欄位名稱:', star_Df.columns) print('前五筆資料如下:') print(star_Df.head(5)) # 使用 matplotlib 套件繪製訓練集資料 for i in range(len(star_Df)): # 色指數, bv-color-index bv = star_Df.loc[i,"B-V"] # 絕對星等, absolute magnitude abMag = star_Df.loc[i,"Amag"] # 是否是巨星，1 代表是巨星，反之 0 代表是矮星 isGiant = star_Df.loc[i,"TargetClass"] # 是巨星的資料標計為紅色，反之為矮星，標記為藍色 color='' if(isGiant == 1): color = 'ro' else: color = 'bo' # 將資料畫在畫布上 plt.plot(bv, abMag, color, markersize = 2) # 標記 X 軸和 Y軸名稱 plt.xlabel('B_V') plt.ylabel('AbMag') # 圖示方框 plt.plot(1.318,16.678352,'bo',markersize=2,label='Drawf') plt.plot(-0.015,15.324928,'ro',markersize=2,label='Giant') plt.legend(borderpad=1) # 將畫布 Show 出來 plt.show() ``` 畫出來的結果如下: ![](https://i.imgur.com/sex4JC0.png) ## 2.2 What is Decision Boundary? 什麼是 Decision Boundary 呢? 簡單來說，它是一個可以用來進行分類的超曲面(hyperSurface)。這樣講實際上還是蠻抽象的，但如果以下圖來看，就可以看出橢圓形狀的邊界徹底將裡面的圖案分成兩類，例如橢圓邊界內部的就歸類為藍色的 :large_blue_circle: ，邊界外部歸類為紅色的:heavy_multiplication_x:。像這樣的邊界，就可以被稱作 Decision Boundary。 ![](https://hackmd.io/_uploads/HkeXvCdon.png) ## 2.3 Decision Boundary 猜測觀察了散佈圖之後，我們發現可以簡單地拿下方 綠色虛線 去當作 Decision Boundary，將綠色虛線上方的點分類為巨星，下方的點分類為矮星，則應該可以將多數的訓練點成功分類。 ![](https://i.imgur.com/keduQLi.png) 因此我們應該可以假設此綠色虛線(Decision Boundary)如下，並以此條直線去做 logistic regression。 :::info $$\mathbf{\theta_0+\theta_1(bv)+\theta_2(abMag)=0},\ \text{for some}\ \theta_0,\ \theta_1\ \text{and} \ \theta_2\in\mathbb{R}.$$ ::: 求出 $\theta_0,\ \theta_1,$ 和 $\theta_2$ 的估計值後，就可以依此 decision Boundary 配合下面的函數進行分類囉。 $L(bv, abMag) = \begin{cases} 1, & \quad \text{if } h(bv,abMag)\geq0\\ 0, &\quad otherwise \end{cases}, where \ h(bv,abMag)=\theta_0+\theta_1(bv)+\theta_2(abMag).$ # 3. Logistic Regression ## 3.1 Logistic function Logistic Regression 主要會用到的 function 為 $$f(z) = \displaystyle\frac{1}{1+e^{-z}},\ z \in\mathbb{R}.$$ ![](https://hackmd.io/_uploads/B1XZWt0in.png) 從此函數的圖形可以觀察到下列的事實 :::success $f(0)=0.5,\ f(z)>0.5\ \text{when}\ z >0,\ f(z)\leq0.5\ \text{when}\ z \leq 0.$ ::: 若此時我們很直覺地將函數 $f(z)$ 的值視為機率，機率大於 0.5 的分成一類(標註為 1)，機率小於0.5的分成一類(標註為 0 )並令$$z(bv,abMag)=\mathbf{\theta_0+\theta_1(bv)+\theta_2(abMag)},$$則會發現若 $z(bv,abMag)>0$ 時，會將其分類為 $1$，反之則分類為 $0$，如下圖， ![](https://hackmd.io/_uploads/SyK93t0i3.png) 我們可以發現若找出來的綠色斜線如上圖為例，則有多數的藍色的點因為在斜線上方，所以被歸類為巨星，跟我們想要的分類結果不太吻合，因此我們知道我們必須找到適當的參數 $\theta_0,\theta_1,\theta_2$，才可以較"精準"的分類巨星和矮星。 ## 3.2 Hypothesis function and Cost function (TODO: Python 程式) Logistic Regression 和 Hypothesis function $h_{\theta}$ 和 Loss function $L(\theta)$ 分別為 $$h_{\theta}(x)=\frac{1}{1+e^{-\theta\cdot x}}\ ,\theta=(\theta_0,\theta_1,\theta_2)\ \text{and}\ x=(1,bv,abMag),$$ $$L(\theta)=-\frac{1}{N}\sum_{i=1}^{N}[ y_i\log(h_{\theta}(x_i))+(1-y_i)log(1-h_{\theta}(x_i)) ]，$$ 其中 $\theta=(\theta_0,\theta_1,\theta_2),\ x=(1,bv,abMag),\ N$ 為資料總筆數，$x_i=(1,bv_{(i)},abMag_{(i)}),\ bv_{(i)} 和\ abMag_{(i)}$ 分別為第 $i$ 筆資料的色指數, 絕對星等。 $y_i=1$ 如果第 $i$ 筆資料為 Giant，反之則為 0 。為了找到適當的參數 $\theta_0,\theta_1$ 和 $\theta_2$，我們必須求得 $$\displaystyle\arg\min_{\theta}J(\theta).$$ 而因為 Loss function $L$ 是 convex function，所以一定存在極值 $\theta$ 使得 $J(\theta)$ 最小，此時我們可以藉由 Stochastic Gradient Descent 來幫助我們求得 $\theta=(\theta_0,\theta_1,\theta_2)$ 的近似值。令 $l=[y\log(h_{\theta}(x))+(1-y)log(1-h_{\theta}(x))]$，則 $$\begin{align*} \displaystyle\frac{\partial l }{\partial \theta_0} &= y\frac{\frac{\partial h}{\partial \theta_0}}{h_\theta}+(1-y)\frac{\frac{\partial h}{\partial \theta_0}}{h_\theta-1} \\&= \frac{\partial h}{\partial \theta_0}[\frac{y}{h_{\theta}}+\frac{1-y}{h_{\theta}-1}] \\ &=\frac{\partial h}{\partial \theta_0}[\frac{h_{\theta}-y}{h_{\theta}(h_{\theta}-1)}]， \end{align*} $$ 我們可以發現 $$\begin{align*} \displaystyle\frac{\partial h}{\partial \theta_0} &= \frac{-\cdot e^{-\theta\cdot x}}{(1+e^{-\theta\cdot x})^2} \\ &=-\cdot e^{-\theta\cdot x}(h_{\theta})^2， \end{align*}$$ 將此式代入上式，可以得到 $$\begin{align*} \displaystyle\frac{\partial l }{\partial \theta_0} &=\frac{\partial h}{\partial \theta_0}[\frac{h_{\theta}-y}{h_{\theta}(h_{\theta}-1)}]\\ &=(h_{\theta}-y)[\frac{-h_{\theta}e^{-\theta\cdot x}}{h_{\theta}-1}] \\ &=(h_{\theta}-y)[\frac{-h_{\theta}e^{-\theta\cdot x}}{\frac{e^{-\theta\cdot x}}{1+e^{-\theta\cdot x}}}] \\ &=(h_\theta-y)[\frac{-h_{\theta}e^{-\theta\cdot x}}{h_{\theta}e^{-\theta\cdot x}}]\\ &=y-h_{\theta}， \end{align*} $$，同理，我們也會有 $$\displaystyle\frac{\partial l }{\partial \theta_1} =(y-h_{\theta})bv $$ 和 $$\displaystyle\frac{\partial l }{\partial \theta_2} =(y-h_{\theta})abMag.$$ 選定起始點 $\theta_0$ 後，便可藉由 Stochastic Gradient Descent 的遞迴關係 :::info $$\displaystyle\theta_{i+1}=\theta_i-\eta\frac{\partial l}{\partial \theta_i},i=0,1,2.$$ ::: 我們便可藉由調整 learning rate $\eta$ 求得係數 $\theta_0,\theta_1,\theta_2$ 的近似解，底下為我們求得係數 $\theta_0,\theta_1,\theta_2$ 的近似解的程式; ```python import math import random as rd import numpy as np import pandas as pd import matplotlib.pyplot as plt import datetime def stochastic_Gradient_Descent(filePath, initial_point, learningRate, iteratedNumber): # 讀取資料 star_Df = pd.read_csv(filePath) # 資料總筆數 dataSize = len(star_Df) # theta_0, theta_1, theta_2 初始值 theta_0 = initial_point[0] theta_1 = initial_point[1] theta_2 = initial_point[2] partial_L_wrt_theta_0 = 0 partial_L_wrt_theta_1 = 0 partial_L_wrt_theta_2 = 0 start_time = datetime.datetime.now() print("StartTime:", start_time) # 進行迭代 for i in range(iteratedNumber): # 從 n 筆資料隨機選取一筆資料 randomIndex = rd.randint(0,dataSize-1) bv = star_Df.loc[randomIndex,"B-V"] abMag = star_Df.loc[randomIndex,"Amag"] isGiant = star_Df.loc[randomIndex,"TargetClass"] z = theta_0 + theta_1*bv + theta_2*abMag a = 1/(1+math.exp((-1)*z)) - isGiant #Loss Function 的偏微分， wrt 為 with respect to 縮寫，L代表 Loss function partial_L_wrt_theta_0 = a partial_L_wrt_theta_1 = a * bv partial_L_wrt_theta_2 = a * abMag theta_0 -= (learningRate) * partial_L_wrt_theta_0 theta_1 -= (learningRate) * partial_L_wrt_theta_1 theta_2 -= (learningRate) * partial_L_wrt_theta_2 print('theta_0:',theta_0) print('theta_1:',theta_1) print('theta_2:',theta_2) current_time = datetime.datetime.now() print("EndTime:", current_time) print("Time Cost", current_time-start_time) cost = 0 try: for i in range(dataSize): # 從 n 筆資料隨機選取一筆資料 randomIndex = rd.randint(0,dataSize-1) bv = star_Df.loc[randomIndex,"B-V"] abMag = star_Df.loc[randomIndex,"Amag"] isGiant = star_Df.loc[randomIndex,"TargetClass"] z = theta_0+ theta_1 * bv + theta_2* abMag h =1/(1+math.exp(-z)) cost -= isGiant *(math.log(h)) + (1-isGiant)*(math.log(1-h)) print('loss:',cost/dataSize) except: print('Something Wrong while Calculating loss') return (theta_0, theta_1, theta_2) ``` ## 3.3 調整參數求得 $\theta$ 近似值藉由選定起始點 $\theta =(\theta_0,\theta_1,\theta_2) = (0,0,0),$ learning rate = 0.0001, 迭代 100000 次，配合下方的 Code，可以求得下面的圖和訓練結果 ```python # 訓練集檔案存放路徑 trainingSet_Path = 'Data\Star3642_balanced.csv' theta = stochastic_Gradient_Descent(trainingSet_Path, (0,0,0), 0.0001, 100000) # 訓練集檔案存放路徑 trainingSet_Path = 'Data\Star3642_balanced.csv' # 使用 panda.read_csv 方法讀取檔案路徑內的資料 star_Df = pd.read_csv(trainingSet_Path) plt.figure(figsize=(7,4)) # 使用 matplotlib 套件繪製訓練集資料 for i in range(len(star_Df)): # 色指數, bv-color-index bv = star_Df.loc[i,"B-V"] # 絕對星等, absolute magnitude abMag = star_Df.loc[i,"Amag"] # 是否是巨星, 1 代表是巨星，反之 0 代表是矮星 isGiant = star_Df.loc[i,"TargetClass"] # 是巨星的資料標計為紅色，反之為矮星，標記為藍色 color='' if(isGiant == 1): color = 'ro' else: color = 'bo' # 將資料畫在畫布上 plt.plot(bv, abMag, color, markersize = 2) t = np.arange(-0.5, 3., 0.001) theta0 = theta[0] theta1 = theta[1] theta2 = theta[2] a = -(theta1/theta2) b= -(theta0/theta2) print("The Decision Boundary is ABMag = ",a, "* B_V+",b,".") plt.plot(t, a*t+b ,'go',markersize=0.5) plt.xlabel('B_V') plt.ylabel('AbMag') plt.show() ``` 迭代結果: ![](https://hackmd.io/_uploads/Hk5EWet2n.png) ![](https://hackmd.io/_uploads/SkqKZxF23.png) 根據圖表，可以觀察到綠色的線分割了大多數的紅點和藍點。 # 4. 評估選取的Decision Boundary 準確率最後我們可以利用資料來源內的「Star39552_balanced.csv」裡面的 39552筆資料，去估算此次分類的準確率。 ```python= import numpy as np import pandas as pd import matplotlib.pyplot as plt count = 0 #parameters of hypothesis function theta_0 = -0.3396695976738397 theta_1 = -1.2728283043292683 theta_2 = 0.09319430690186153 # 預測的函數 def predict(bv, abMag): predictClass = 0 if( (theta_0 + theta_1*bv+ theta_2*abMag) >= 0 ): predictClass = 1 return predictClass # 測試集檔案存放路徑 testingSet_Path = 'Data\Star39552_balanced.csv' # 使用 panda.read_csv 方法讀取檔案路徑內的資料 star_Df = pd.read_csv(testingSet_Path) dataSize = len(star_Df) # 使用 matplotlib 套件繪製訓練集資料 for i in range(dataSize): # 色指數, bv-color-index bv = star_Df.loc[i,"B-V"] # 絕對星等, absolute magnitude abMag = star_Df.loc[i,"Amag"] # 是否是巨星，1 代表是巨星，反之 0 代表是矮星 isGiant = star_Df.loc[i,"TargetClass"] #判別分類是否正確 if(predict(bv, abMag) == isGiant): count += 1 print("The accuracy rate is",(100*count)/dataSize,"%.") ``` 使用上方的 Python code，可以得到**準確率約為 86.453%。**，看起來還算是個不錯的數字。 ![](https://hackmd.io/_uploads/HJGWmxK32.png) 最後這39552筆的資料畫出來的圖為以下結果: ![](https://i.imgur.com/v2gOlbH.png) # 5. 使用 Scikit Learn 進行 Logistic Regression 除了上述的程式外，我們也可以用其他套件簡單地進行 Logistic Regression，底下以 Scikit Learn 為範例； ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics # 訓練集檔案存放路徑 trainingSet_Path = 'Data\Star3642_balanced.csv' # 使用 panda.read_csv 方法讀取訓練集內的資料 star_Df_ForTraining = pd.read_csv(trainingSet_Path) # 取得訓練集 X_train = star_Df_ForTraining[['B-V','Amag']].to_numpy() y_train = star_Df_ForTraining[['TargetClass']].to_numpy().ravel() # 測試集檔案存放路徑 testingSet_Path = 'Data\Star39552_balanced.csv' # 使用 panda.read_csv 方法讀取測試集內的資料 star_Df_ForTesting = pd.read_csv(testingSet_Path) # 取得測試集 X_test = star_Df_ForTesting[['B-V','Amag']].to_numpy() y_test = star_Df_ForTesting[['TargetClass']].to_numpy().ravel() # Logistic Regression fit 測試集 logistic_regression = LogisticRegression(random_state = 100) logistic_regression.fit(X_train, y_train) #使用測試集預測結果 y_pred = logistic_regression.predict(X_test) # 取得 Confusion Matrix cnf_matrix = metrics.confusion_matrix(y_test, y_pred) # 印出 Confusion Matrix print("The Confusion Matrix is:") print(cnf_matrix) # 取得 Accuracy testSet_Size = len(star_Df_ForTesting) print("Accuracy:", (cnf_matrix[0][0]+cnf_matrix[1][1])/testSet_Size) ``` 底下為使用 Scikit Learn 套件的結果，可以發現 Accuracy 也差不多是同樣的數字。 ![](https://hackmd.io/_uploads/HJjxQWY33.png) # 6. Summary 仔細觀察一下，就能發現該手法能簡單的對 N 類問題進行分類囉，例如將資料分成"A類" 和"非A類"，再從 "非A類" 中分出 "B類" 和 "非B類"，接著再從"非B類" 中分出 "C類" 和 "非C類"，以此類推，就能分出 "N類"了ＸＤ以上就是 Logistic Regression 最簡單基本的範例與用法 :)) # 參考資料來源： 1.Giant star : https://en.wikipedia.org/wiki/Giant_star 2.Dwarf star : https://en.wikipedia.org/wiki/Dwarf_star 3.Absolute magnitude : https://en.wikipedia.org/wiki/Absolute_magnitude 4.Luminosity: https://en.wikipedia.org/wiki/Luminosity 5.B_V color index: https://en.wikipedia.org/wiki/Color_index 6.數據來源:https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs 7.sklearn Logistic Regression:https://www.datacamp.com/tutorial/understanding-logistic-regression-python