機器學習 - HackMD

###### tags: `Study` `College` # 機器學習 ## 查看檔案內資料 ``` dataset.head() dataset.info() dataset.isna().sum() ``` **comment + i 可以查看解說** --- ## 資料預處理 ### 前言任何一筆資料，要先檢查資料有無缺失，如果有缺失則用所學過的方式把缺失資料補齊，才會往下一個階段進行，像是資料分析，線性回歸。 ![](https://i.imgur.com/UuMzFZW.png) --- 先介紹一下Pandas是什麼，簡單來說就是把Excel的表格觀念丟到Python，你在Excel所有的操作都可以透過Pandas的函式做簡單的處理，想是欄位的加總、分群、樞紐分析表、小計、畫折線圖、圓餅圖等等。在介紹Pandas之前有許多書籍會提到Numpy，主要原因是因為Pandas背後的數值型態都是Numpy，Numpy的資料結構可以幫助Pandas在執行運算上更有效率以及更省記憶體。 We need pandas and numpy as our tools , so import them. ### Importing the libraries ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt ``` pandas：資料處理 matplotlib.pyplot：畫出圖型 --- ### Importing the dataset ```python= dataset = pd.read_csv("./name.csv") x = dataset.iloc[:, :-1 ].values y = dataset.iloc[:, 3].values ``` --- ### Dealing With Missing Data ```python= from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.nan, strategy='mean', fill_value = None) imputer = imputer.fit(x[:, 1:3]) x[:, 1:3] = imputer.transform(x[:, 1:3]) ``` sklearn.impute : 數據預處理的library SimpleImputer : 處理數據缺失類別 --- ### Categorical Data 分類數據 ```python= from sklearn.preprocessing import LabelEncoder , OneHotEncoder from sklearn.compose import ColumnTransformer labelencoder_x = LabelEncoder() x[:,0] = labelencoder_x.fit_transform(x[:,0]) ct=ColumnTransformer([("Country", OneHotEncoder(),[0])] , remainder='passthrough') x = ct.fit_transform(x) labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y) ``` categorical data：標示每一類的數據 LabelEncoder : 讓分類型欄位(eg.國家)變成純數字的0 1 2 , 然後可以再用Dummy Encoding變成001 010 100這種沒有大小順序之差別的排序。使用sklearn中的preprocessing的OneHotEncoder類別：提供虛擬編碼處理的方法：Dummy Encoding OneHotEncoder : 為了虛擬變量而做的 --- ### Splitting the Dataset into the Training set and Testset ```python= from sklearn.model_selection import train_test_split x_train , x_test , y_train , y_test = train_test_split( x , y , test_size = 0.2 , random_state = 0 ) ``` --- ### Feature Scaling 特徵縮放 ```python= from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) ``` 使用sklearn中的preprocessing的StandardScaler類別：提供標準化的方法建立StandardScaler()物件透過StandardScaler()中的fit_transform()方法對x_train進行擬合並轉換因為sc_x已經被擬合過，故4行的sc_x不需要再用fit_transform()方法，可直接使用transform()方法 ::: spoiler 範例程式 ```python= import numpy as np import matplotlib.pyplot as plt import pandas as pd dataset = pd.read_csv('Data.csv') x = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.nan, strategy='mean', fill_value = None) imputer = imputer.fit(x[:, 1:3]) x[:, 1:3] = imputer.transform(x[:, 1:3]) from sklearn.preprocessing import LabelEncoder , OneHotEncoder from sklearn.compose import ColumnTransformer labelencoder_x = LabelEncoder() x[:,0] = labelencoder_x.fit_transform(x[:,0]) ct=ColumnTransformer([("Country", OneHotEncoder(),[0],)] , remainder='passthrough') X = ct.fit_transform(x) labelencoder_y = LabelEncoder() y = labelencoder_y.fit_transform(y) from sklearn.model_selection import train_test_split x_train , x_test , y_train , y_test = train_test_split( X , y , test_size = 0.2 , random_state = 0 ) from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) ``` ::: ######y=ax+b , y是應變量 x是自變量 --- ## 簡單線性回歸 Simple Regression ![](https://i.imgur.com/pM5kLAx.png) **一個應變數與一個自變數間之線性關係，此模型稱之為簡單線性迴歸模型**。迴歸分析是將研究的變數區分為**應變數**及**自變數** 並建立應變數(Y)與自變數(X)之函數模型，然後再根據樣本所得的資料來估計函數模型的參數，其主要目的: 是用來解釋資料過去的現象 **利用自變數(X)來預測應變數(Y)未來可能產生之數值**。 ### 訓練集合跟測試集合的分割 ``` from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0 ) ``` --- ### 訓練資料 ``` from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(x_train , y_train) y_pred =regressor.predict(x_test) #預測測試集合的結果 ``` regressor.fit( 自變量 , 應變量 ) 執行後，機器會開始做訓練模型 y_pred = regressor.predict(測試集合的自變量X) --- ``` plt.scatter(x_train,y_train, color='red') plt.plot(x_train,regressor.predict(x_train),color='blue') plt.title('Salary VS Experience (training set)') plt.xlabel('year of Experience') plt.ylabel('Salary') plt.show() ``` plt.scatter( x之值 , y之值 ,點的顏色 )：**scatter用在畫'點'上** plt.plot( x之值 , y之值 , 線的顏色)：**plot的output會是線的結果** 其中，regressor.predict(x_train)很像是f(x)=Y，所以是y之值 ### 計算殘差用測試結果之值減去預測之值： residuals = y_test - y_pred y_test為基準實際輸出減預測輸出確定的值減掉測試輸出之值 ### 繪製殘差圖 ``` import seaborn as sns sns.residplot(x_test.flatten(),residuals.flatten(),lowess=True,color="g") plt.xlabel("x") plt.ylabel("Residuals") plt.title("Residual Plot") ``` #計算殘差的平均值,標準差,中位數等統計量 ``` residuals_mean = np.mean(residuals) residuals_std = np.std(residuals) residuals_median = np.median(residuals) ``` #計算殘差是否符合常態分數 p <0.05 p小於0.05符合常態分佈 ``` from scipy.stats import shapiro _,p_value = shapiro(residuals) print("Shapiro-Wilk normality test p-value:", p_value) ``` --- ## 多元線性回歸 Multilple Linear Regression ![](https://i.imgur.com/N6S0obr.png) ![](https://i.imgur.com/bUaedJo.png) ![](https://i.imgur.com/WUYPMmT.png) ### dummy encoding ![](https://i.imgur.com/MQQXCiN.png) 非連續型欄位，故沒有x4。需做標籤編碼 ![](https://i.imgur.com/OHfkwbR.png) 做完標籤編碼後，因為那兩個欄位存在高度線性相關，不符合我們前面的假設，所以需要再細分一次並且忽略一個虛擬變量，避免誤入陷阱。 ![](https://i.imgur.com/sKf0Bhy.png) ![](https://i.imgur.com/RQzIae1.png) ![](https://i.imgur.com/B1CP09s.png) backward elimination 反向淘汰：先做再說，慢慢淘汰 stepwise regression 逐步回歸 score comparison：把所有得出來的結果去做排列組合，看哪個結果好就用哪個 --- #建立多元線性回歸模型 #多筆資料，所以是多元而非簡單線性回歸 ```python= from sklearn.linear_model import LinearRegression regressor=LinearRegression() regressor.fit(x_train,y_train) y_pred=regressor.predict(x_test) ``` y_pred之值是由x(x_test)來決定，用測試之值去預測結果，再依照實際的數據去判斷預測結果如何。 --- #淘汰資料欄位 ``` import statsmodels.api as sm x_train=np.append(arr=np.ones((1070,1)).astype(int),values=x_train,axis=1) x_opt=x_train[:,[0,1,2,3,4,5]] x_opt=np.array(x_opt,dtype=float) regressor_OLS=sm.OLS(endog=y_train,exog=x_opt).fit() regressor_OLS.summary() ``` x_train=np.append(arr=np.ones((1070,1)).astype(int),values=x_train,axis=1) :1070為x_train的資料數量，也就是訓練資料。第一次反向淘汰 ``` x_opt=x_train[:,[0,1,3,4,5]] x_opt=np.array(x_opt,dtype=float) regressor_OLS=sm.OLS(endog=y_train,exog=x_opt).fit() regressor_OLS.summary() ``` 第二次反向淘汰 ``` x_opt=x_train[:,[0,1,3,4]] x_opt=np.array(x_opt,dtype=float) regressor_OLS=sm.OLS(endog=y_train,exog=x_opt).fit() regressor_OLS.summary() ``` 第三次反向淘汰 --- ## 多項式回歸 Polynomial Regression 多項式回歸的圖形會呈現拋物線，因為是一元多項式方程式 ![](https://i.imgur.com/KRPAgR4.png) ### Why Linear? ![](https://i.imgur.com/4SBYQb0.png) 因為**b₀ + b₁x₁ +b₂x₁²** ... 他們之前仍處於一個相加相乘，線性組合的關係，如果今天有除法，則就不是線性關係。 ``` from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) ``` --- ``` from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=4) X_poly = poly_reg.fit_transform(X) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y) ``` degree 維度，也就是幾次方的意思，以多項式回歸來說，degree至少要2，不然就是簡單線性回歸。 X_poly = poly_reg.fit_transform(X) 針對自變量去進行處理，結果會output出一個多項式方程式在output中，b₀會自動乘以1(x₀) --- ```python= plt.scatter(x, y ,color = 'red') plt.plot(x, lin_reg_2.predict(poly_reg.fit_transform(x)), color = 'blue') plt.title('Truth or Bluff(Polynomial Regression') plt.xlabel('Position Level') plt.ylabel('Salary') plt.show() ``` output出一個預測模型 --- ```python= #visualising the Polynomial Regression results X_grid = np.arange(min(X),max(X),0.1) X_grid = X_grid.reshape(len(X_grid),1) plt.scatter(X, y ,color = 'red') plt.plot(X_grid, lin_reg_2.predict(poly_reg.fit_transform(X_grid)), color = 'blue') plt.title('Truth or Bluff(Polynomial Regression') plt.xlabel('Position Level') plt.ylabel('Salary') plt.show() new_x = 6.5 new_x = np.array(new_x).reshape(-1,1) lin_reg.predict(new_x) lin_reg_2.predict(poly_reg.fit_transform(new_x)) ``` 這段程式主要是讓圖形呈現更平滑所做的改良，原理是讓級距變得更短，本來是1 為一個單位，現在是0.1，所以點跟點所連成的線會更密集，所以在視覺上會更好。 new_x需要reshape才能符合結構。 lin_reg.predict(new_x)　簡單線性回歸的數值輸出lin_reg_2.predict(poly_reg.fit_transform(new_x))　多項式回歸的數值輸出 --- ## R平方 R Squared ![](https://i.imgur.com/gOGwXNd.png) yi為期望輸出, yi hat為實際輸出兩個數值的差，也就是殘差越小，代表準確度越高，因為跟期望輸出相似。 ![](https://i.imgur.com/wxqPmz0.png) R^2 = 1 - 殘差平方和/總平方和　，其中總平方和=yi-y平均。　ｙ平均=實際值的平均。對於一個data而言，分母也就是總平方和不會變。 --- ### 解讀R squared值 ![](https://i.imgur.com/DEwYnBL.png) 依照定義，殘差平方和/總平方和　只要越大，1扣掉其值就會越小，也就是說R^2越小代表模型越不準確，因為殘差越大代表實際跟預測差距大，不準確的意思。反過來說R^2越大，代表模型越理想。 R^2之值會介於0~1之間， 0≦R^2≦1 --- ## 調整R平方係數 Adjusted R Squared ![](https://i.imgur.com/bwE9HQn.png) SSres越小越好，而且R^2不會變小，因為如果多加了某個自變量，可以把它前面的bx係數設定為0就好。 ![](https://i.imgur.com/sh5pNLn.png) --- ## 邏輯回歸 Logistic Regression 處理分類問題，目標是找到一條直線可以將資料做分類 --- ```python= from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(x_train , y_train) y_pred = classifier.predict(x_test) ``` --- ```python= from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test,y_pred) ``` 使用矩陣運算去觀察輸出結果跟拿去測試的資料比對，去看相似程度 ![](https://i.imgur.com/xjHT1rG.png) 上面的column是0 , 1因為只有是或是否，要看當初的定義。左上右下的對角線是正確資料的數量，以上圖來看就是沒購買的之中，正確的有57，有購買的正確的有17 左下右上的對角線是錯誤的數量沒購買的之中，有5個是錯的，有購買的錯誤的是1個。 --- ```python= from matplotlib.colors import ListedColormap x_set, y_set = x_train, y_train x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01), np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(x1.min(), x1.max()) plt.ylim(x2.min(), x2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('orange', 'blue'))(i), label = j) plt.title('Logistic Regression (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() ``` ![](https://i.imgur.com/8ahptMr.png) 訓練用資料的輸出結果紅色區域為0，綠色區域為1 ![](https://i.imgur.com/bvIoQ0Z.png) 測試用的訓練資料 :::spoiler Example code ```python= import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv("Social_Network_Ads.csv") x = dataset.iloc[:,2:4].values y = dataset.iloc[:,4].values from sklearn.model_selection import train_test_split x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 , random_state = 0 ) from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(x_train , y_train) y_pred = classifier.predict(x_test) from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test,y_pred) from matplotlib.colors import ListedColormap x_set, y_set = x_train, y_train x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01), np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(x1.min(), x1.max()) plt.ylim(x2.min(), x2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('orange', 'blue'))(i), label = j) plt.title('Logistic Regression (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() x_set, y_set = x_test, y_test x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01), np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(x1.min(), x1.max()) plt.ylim(x2.min(), x2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('orange', 'blue'))(i), label = j) plt.title('Logistic Regression (Testing set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() ``` ::: --- ## SVM Support Vector Machine 支援向量機 ### Definition •SVM是在分類與迴歸分析中分析資料的監督式學習模型與相關的學習演算法。 •SVM最主要的概念，就是希望可以在一個由不同類別混合而成的資料集中，依據一些特徵(feature)，找到一個最佳的超平面(hyper plane)將不同類別的資料分開來。所謂最佳的超平面就是其距離兩個類別的邊界可以達到最大，而最靠近邊界的這些樣本點提供SVM最多的分類資訊，就叫做支持向量(Support Vector)。 :::spoiler 參考程式 ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt dataset = pd.read_csv("Social_Network_Ads.csv") x = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:,4].values from sklearn.model_selection import train_test_split x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.25 , random_state = 0 ) from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) from sklearn.svm import SVC classifier = SVC(kernel = 'linear',random_state =0) classifier.fit(x_train,y_train) y_pred = classifier.predict(x_test) from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test,y_pred) from matplotlib.colors import ListedColormap x_set, y_set = x_train, y_train x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01), np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(x1.min(), x1.max()) plt.ylim(x2.min(), x2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('orange', 'blue'))(i), label = j) plt.title('SVM (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() x_set, y_set = x_test, y_test x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01), np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(x1.min(), x1.max()) plt.ylim(x2.min(), x2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('orange', 'blue'))(i), label = j) plt.title('SVM (Testing set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() ::: --- ## SVM 圖像辨識 Support Vector Machine ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt dataset = pd.read_csv("mnist_train.csv") x = dataset.iloc[:, 1:785].values y = dataset.iloc[:,0].values ``` x的範圍: 第0個pixel(1), 到最後一個pixel(783), 也可以寫作[ : , 1:] y的定義: 數字, 也就是答案 --- ```python= show_img = np.reshape(x[1,0:784],(28,28)) plt.matshow(show_img,cmap = plt.get_cmap('gray')) plt.show x[x>0]=1 from sklearn.model_selection import train_test_split x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.5 , random_state = 0 ) ``` 找到第1列索引值的所有column, 接著把他們做reshape的動作, 變成28∗28的pixel, 接著把他們做灰階值, 只顯示gray的顏色. 因為一開始我們把原本的圖片拉成784的大小, 所以才要在reshape回去最後一行把所有超過0的數值(1~255)的, 一律變成0 這樣就可以確保有值的部分為純白色1 --- ```python= from sklearn.svm import SVC classifier = SVC(kernel = 'rbf',random_state =0) classifier.fit(x_train,y_train) y_pred = classifier.predict(x_test) ``` rbf: 高斯核函數為一多項式的function --- ```python= from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test,y_pred) from sklearn.metrics import accuacy_score accuacy_score(y_test,y_pred) ``` 查看預測結果 --- ### 讀取手寫數字 ```python= import matplotlib.image as mping img = mping.imread('4.png') img = img[:,:,2] plt.matshow(img, cmap = plt.get_cmap('gray')) plt.show() test_img = img.reshape((1,784)) img_class = classifier.predict(test_img) ``` mping.imread: 讀檔 img[:,:,2]: reshape, 因為剛讀進來會以為他是rgb;3維,所以改成2維就好 reshape(1,784) 要把28∗28的圖檔修改成符合上面訓練的格式 --- ## 簡單貝氏分類器 Naive Bayes ### Why Naive? Independence assumption, 假設變數(classes)之間為獨立事件, 沒有相關 ```python= from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(x_train, y_train) ``` 這邊之後可以再用confusion_matrix來觀察 --- ## 決策樹 Decision Tree ![](https://hackmd.io/_uploads/HJLHJjirn.png) ![](https://hackmd.io/_uploads/ryQikoiB2.png) --- --- ## 單字 scatter 分散；散步；離散(在分布圖中指'點') regression 回歸；病情消退 regressor 回歸量；回歸分析 plot 圖表 kernel 核心；內核 categorical data 分類數據 feature scaling 特徵縮放 data preprocessing template 數據預處理模板(範例意思) residual 殘差；殘餘的 degree 維度；數量；程度；水準 naive bayes 簡單貝氏分類器 naive 簡單;單純 bayes 貝氏 --- https://hackmd.io/@wMAlIWIwRW6zmAGjk95JWA/rJrrfqXm2#%E9%82%8F%E8%BC%AF%E5%9B%9E%E6%AD%B8 https://hackmd.io/@wMAlIWIwRW6zmAGjk95JWA/B1tYmnUQh#%E9%82%8F%E8%BC%AF%E5%9B%9E%E6%AD%B8