機器學習

tags: 數據分析 機器學習

機器學習概論

依照輸入分類

Supervised Learning

  • 輸入資料有答案(label)

Unsupervised Learning

  • 輸入資料沒答案(label)
  • Cluster unlabelled data
    • 讓電腦把資料相似的分類在一起,相異的放遠一點

Semi-supervised Learning

  • 一半的資料有答案,一半的資料沒答案

Reinforcement Learning

  • 給她沒答案的資料
  • 如果答對給予正回饋,答錯就給負回饋

依照輸出分類

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Regression

  • 預測的東西是一個連續的數值
  • ex:
    • 股價預測
    • 職棒比賽,確切比數
  • 預測精確數值,難度較高

Classification

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 預測的東西是一個類別
  • ex:
    • 股價漲或跌的預測
    • 職棒比賽,誰贏誰輸
  • 不需要精確預測,難度較簡單

Machine learning workflow

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

資料輸入

特徵工程

資料缺失值
  • 數字型的
    • 整筆資料不要用
    • 填平均值或中間值
  • 類別型的
    • 取眾數
    • 設立一個其他類別
極端值
  • 刪除極端值
Split data
  • Training data
    • 訓練資料
    • 通常80%
    • 通常大於testing data
  • Testing data
    • 測試資料
    • 通常20%
    • 通常小於traing data
Normalization

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 數量級差太多,學到的參數的數量級會相差非常大
  • 讓每個值的欄位壓到0跟1之間或-1跟1之間
  • 常用方法
    • min_max
      • [0,1]
      • xixminxmaxxmin
    • Z-Score Standardization
      • [-1,1]
      • xiμσ

Select model

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Validate trained model - regression
  • 拿來評量模型表現好不好
MSE
  • MSE=1ni=1n(fiyi)2
  • fi
    : 預測值
  • yi
    : 實際答案
  • n: 數值數量
  • 值越小,代表誤差越小
R squared(coefficient of determination)
  • 觀念問號
  • R2=1i=1n(fiy¯)2i=1n(yiy¯)
  • yi
    : 實際答案
  • fi
    : 預測值
  • y¯
    : 答案的平均
  • 預測的離散程度跟答案的離散程度是否相近
  • 越接近1代表表現越好,值越小表現越爛
Validate trained model - classification
accuracy=#ofcorrectprediction#ofdata

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

accuracy=TP+TNTP+FP+FN+TN

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • True Positive (TP)「真陽性」:真實情況是「有」,模型說「有」的個數。
  • True Negative(TN)「真陰性」:真實情況是「沒有」,模型說「沒有」的個數。
  • False Positive (FP)「偽陽性」:真實情況是「沒有」,模型說「有」的個數。
  • False Negative(FN)「偽陰性」:真實情況是「有」,模型說「沒有」的個數。
  • Precision=TPTP+FP
  • Recall=TPTP+FN
F1-score
  • F1=211recall+1precision=2precisionrecallprecision+recall
  • 越接近1表現越好,越接近0越差
ROC
  • 針對不同的門檻值去畫出ROC curve
  • AUC(area under curve)
    • AUC=0.5 (no discrimination 無鑑別力)

    • 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)

    • 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)

    • 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)

Confusion matrix

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 把二元分類變多元分類
  • 對角線上是猜對的
Bias-Variance Tradeoff

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Variance : 穩定與不穩定
  • Bias : 資料集不集中
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • 選擇模型太簡單(變數太少之類的),那會落到high bias的區域
  • 選擇模型太複雜,會落到high variance的區域,會不穩定
    • overfitting
No free lunch theory

  • 沒有任何一個演算法可以勝過其他所有的演算法
  • 根據不同問題有不同的解法

輸出

應用領域

  • 自動駕駛
  • 支付
    • 人臉辨識支付
  • 醫療應用
    • 腫瘤偵測
  • 機器人理財
  • 無人機
    • 自動避障
  • 預測推銷
    • youtube推薦演算法
  • 智慧工廠
  • 語音智慧助理
    • siri

Regression Supervised Learning

Linear Regression

  • 依現有的data,找到一條趨勢線,預測未知的data
  • 線性解

Caes

Suppose 房價只跟屋齡有關

可以依照data畫出多條趨勢線

利用cost function 找到最適合(誤差最小的)的趨勢線

Cost=i=1m(yiyi¯)2

微積分求解

Multivariate Linear Regression Models


  • 多變量時難以用公式解,變數太多了
  • 通常用gradient descent
Gradient Descent

  • 只能找區域最小值
  • learning rate太小會算很久,太大找不到最佳值,通常用0.1 到 1 之間
Gradient Descent V.S. Normal Equation

  • Gradient Descent
    • learning rate要剛剛好
    • 要迭代很多次
    • n大時會比較好(資料量大)
  • Normal Equation
    • 不用選learning rate
    • 要計算矩陣逆運算,較麻煩
    • n大時算很慢(因為矩陣逆運算)
Overfitting
  • 最左(underfitting)
    • 無法找到理想趨勢
  • 中間(理想狀況)
  • 最右(overfitting)
    • 在train data上fit的很好,但有新東西就gg了
    • 像死讀書,刷考古題的學生
    • 變數越多越容易overfitting

Code

from sklearn import preprocessing, linear_model from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split import numpy as np import pandas as pd # 資料輸入 df = pd.read_csv('./dataset/housing.csv', header = None, delim_whitespace=True) # 答案取出 y = df[13] x = df.drop(13, axis = 1) # Split data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1) # Normalization scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) # Model Select model = linear_model.LinearRegression() model.fit(x_train, y_train) # Predict y_pred = model.predict(x_test) print('Cofficient : {}'.format(model.coef_)) print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred))) print('Variance score : {}'.format(r2_score(y_test, y_pred)))
Cofficient : [-1.02554944  0.96553896  0.16729159  0.58076865 -2.02969655  2.57536082
  0.17046044 -2.84987298  2.50778431 -1.85852862 -2.05829633  0.82864609
 -3.86813384]
Mean squared error : 34.403276579602064
Variance score : 0.6739078917414478

Polynomial Regression

  • 非線性解
  • 低維度變高維度

nth-degree Polynomial Regression

  • n次多項式,當作多個feature
  • 其他model

Graph

交叉項(cross term)

  • 兩變數有加減乘除的關係(多為乘法)
  • 原本兩個feature,經由相乘或是次方,多出很多個feature
import numpy as np import pandas as pd from sklearn import linear_model, preprocessing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.preprocessing import PolynomialFeatures # 匯入檔案 df = pd.read_csv('./dataset/winequality-red.csv') # 處理answer and data y = df['quality'] x = df.drop('quality', axis = 1) # 產生degree 為 2 的feature poly = PolynomialFeatures(degree = 2).fit(x) x = poly.transform(x) # Split data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) # Normalization scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) # Select model model = linear_model.LinearRegression() model.fit(x_train, y_train) # Predict y_pred = model.predict(x_test) # 查看係數 print('The coefficient : {}\n'.format(model.coef_)) print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred))) print('Variance score : {}'.format(r2_score(y_test, y_pred)))
The coefficient : [-2.20439093e-11 -3.32036477e+01 -3.46762494e+01 -1.98569253e+01
 -1.40652656e+01 -6.85273381e+01 -7.74114997e+01  1.00727869e+02
 -2.50880009e+01 -7.93478035e+01  4.63945174e+01 -6.70038809e+00
 -1.07912073e+00 -4.28103408e-01 -1.80118094e-01 -5.27741073e-01
 -9.30638006e-01 -6.23191815e-01  5.34638897e-01  3.69394013e+01
 -1.63849368e+00  6.25798078e-01  6.86240624e-02 -1.07543978e-01
  3.25708843e-02 -6.49414077e-02  2.15989372e-01 -5.66509424e-02
  3.43575308e-01  3.51401508e+01 -1.05497188e+00 -2.55960941e-01
  8.98178254e-01 -3.09937110e-02  9.34787505e-02  8.98840857e-02
  1.48808583e-01 -3.20881226e-02  2.21970228e+01 -3.13608092e+00
 -3.12173913e-01  1.11046257e+00 -1.59627758e-01 -2.03532456e-01
  3.43111110e-02 -6.51049711e-04  1.79109779e+01 -2.69409190e+00
  1.29726359e-02 -3.65247752e-01  6.44755908e-02 -5.20131174e-02
 -1.18159029e-01  6.92374227e+01 -5.56852862e-01  1.09112468e-01
  5.56498217e-01 -1.63840021e-01 -3.70231215e-02  7.79292790e+01
 -5.47020463e-01 -4.70828794e-01  1.29740195e+00  8.93119391e-02
 -1.01689650e+02  1.28353602e+00  4.09945512e-01 -1.61154603e+00
  2.10910693e+01  8.08256509e+01 -4.72908479e+01  4.26542501e+00
 -2.92688018e+00  1.05074947e+00  2.71306226e+00 -3.52447248e-01
  7.49136902e-02 -1.16178316e-01]

Mean squared error : 0.4308785588449687
Variance score : 0.29206509289757054

Logistic Regression

  • X : feature
  • θ
    : 參數

Sigmoid function

  • Output is [0, 1]
  • model的輸出常被拿來當作機率

Cross-entropy)(Cost Function)補)

  • 給電腦兩個機率分佈,分析兩個機率分佈相不相似
  • 兩機率分佈幾乎一樣,算出來越小
    C(θ0,θ1,...,θn)=12mi=1m

Information(資訊含量)

  • pi : 發生某事的機率
  • Define Information :
    log(1pi)
  • 當某事機率越大,資訊含量越低
    • 太陽從東邊升起的機率是100 %
    • Information =
      log(11)=0
  • 當某事機率越小,資訊含量越高
    • 明天下雨的機率是50 %
    • Information =
      log(112)=0.3

Entropy V.S. Cross-entropy

Entropy
  • 資訊含量的期望值
  • 資訊含量越不確定時越高,越確定時越低
Cross-entropy
  • 輸入兩個機率分佈
  • 衡量兩個機率分佈相不相近

Example

  • Cross entropy 通常大於entropy

Cost Function(cross-entropy)(問號)

  • h 與 1 - h都是機率分佈
  • y 跟 1 - y,非一即零,

Learning

  • Gradient descent

Code

import numpy as np import pandas as pd from sklearn import preprocessing, linear_model from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression pima = pd.read_csv('./dataset/pima-indians-diabetes.csv') #x = pima[['pregnant', 'insulin', 'bmi', 'age']] y = pima['label'] x = pima.drop(['label'], axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = LogisticRegression() model.fit(x_train, y_train) print(model.coef_) print(model.intercept_) y_pred = model.predict(x_test) print(y_pred) accuracy = model.score(x_test, y_test) print(accuracy)
[[ 0.33498141  1.03029784 -0.30681406 -0.02318156 -0.07418398  0.67294487
   0.2001099   0.20383277]]
[-0.90342628]
[0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0
 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0
 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0
 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0
 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0]
0.7835497835497836

Classification Supervised Learning

K-Nearest Neighbor

  • 可同時classification跟regression

  • 通常用來classification

  • 無變數的

  • 只需要決定k值

  • k : 有多少鄰近的類別

  • k = 1

    • 最近的一個是正方形,所以他是正方形類別
  • k = 3

    • 最近的三個是一個正方刑跟兩個三角形,三角形較多,故為三角形類別
  • k = 7

    • 最近的七個是四個正方刑跟三個三角形,正方形較多,故為正方形類別

Step

  1. Look at data
  • 把原本的資料做好分類
  1. Caculate distances
  • 將每筆data跟所求data的距離求出來
  1. Find neighbors
  • 找最近的k個資料
  1. Vote from labels
  • 最後做投票
  • 看哪一個類型最多

How to Define Distance

  • Manhattan distance(L1 distance)

  • Euclidean Distance(L2 distance)

How to choose K

  • 通常取奇數
    • 必可以選出來,不會有相等的狀況
  • K is small
    • 易受不好的資料影響
  • K is large
    • 不小心預測出多數的類別
    • 永遠預測到同一個類別

Curse of dimensionality

  • 利用次方增維
  • 在高維度較容易分類
  • 如果太高維,會造成overfitting

Code

import pandas as pd import numpy as np from sklearn import preprocessing, datasets, neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) y = df[7] - 1 x = df.drop(7, axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = neighbors.KNeighborsClassifier(n_neighbors = 5) model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_samples = accuracy_score(y_test, y_pred, normalize = False) print('number of correct samples : {}'.format(num_correct_samples)) print('accuracy : {}'.format(accuracy))
number of correct samples : 39
accuracy : 0.9285714285714286

Decision Tree

  • 把決策過程變成一棵樹的結構

How to split on each node?

How to define a good split?

  • 決策效果極明顯的,比如如果A就一定Yes,反之則No,則A就是一個好的node
  • gain = 分割前 - 分割後
  • gain越大越好

CART

  • Classification and Regression Trees(CART)
  • Binary tree
  • Gini為分割良好指標
Gini Impurity
  • 公式
  • 越勢均力敵時,gini越大,反之,gini越小
  • gini越小越好

Exampl_1
  • 分割後,求出gini的期望值
  • A
    • 0.5 - 0.4852 = 0.015
  • B
    • 0.5 - 0.37 = 0.13
  • B較佳
Example_2
  • 遇到數值型的
    • 取平均當分界線(在這邊是80)

ID3

  • Iterative Dichotomiser 3(ID3)
  • 可以有多個分支
Entropy
  • 公式
  • 越勢均力敵,entropy越大,反之,entropy越小
  • Entropy 越小越好

Example_1
  • 分割後,求出entropy的期望值
  • A
    • 0.301 - 0.294 = 0.007
  • B
    • 0.301 - 0.242 = 0.069
  • B較佳
Example_2
  • Before splitting
  • After splitting
  • 對所有欄位都split看看,取最好的
  • Overcast 的狀況都會是yes,所以就不需要決策node了
  • 重複前面的動作,找出其他的節點
  • 完成
優點
  • 寫成ifelse,可解釋性高
Step

Pruning
  • 為了分類,可能會overfitting,決策樹長很深
  • Pre-prunung
    • 設一些條件,如果達到條件就不往下長
    • 如果node所含的資料少於特定大小,就不繼續往下長
  • Post-prunung
    • 長完後,取自己想要的部分

Code(CART)

import pandas as pd import numpy as np from sklearn import preprocessing, neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.tree import DecisionTreeClassifier df = pd.read_csv('./dataset/abalone.csv', header = None) Mean = np.mean(df[8]) df[0] = pd.Categorical(df[0]).codes df[8] = df[8].apply(lambda x : 0 if x > Mean else 1) y = df[8] x = df.drop(8, axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = DecisionTreeClassifier() model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_samples = accuracy_score(y_test, y_pred, normalize = False) con_matrix = confusion_matrix(y_test, y_pred) print('number of correct sample : {}'.format(num_correct_samples)) print('accuracy : {}'.format(accuracy)) print('con_matrix : {}'.format(con_matrix))
number of correct sample : 595
accuracy : 0.7117224880382775
con_matrix : [[318 118]
 [123 277]]

Naive Bayes

  • 機率分類器
  • 條件機率
  • 資料量大時很適合用Naive Bayes
  • 特徵越多會呈線性成長
  • 文本分類

How Naive Bayes Classifier Work

  • 看哪邊機率大,就屬於那一邊
  • 轉成數學式
  • 簡化後
    • 看哪邊機率比較大,就屬於哪一類

Example

  • 想知道句子是否跟運動有關
  • 想知道"a very close game"是否跟運動有關
  • 推導後
How to calculate term?
  • p(game|Sports)=211
  • 11 : 在sports的條件下,有十一個字
  • 2 : 在sports的條件下,出現過2次games
  • 竟量不要有0的狀況,因為這樣機率直接變零了
Laplace smoothing
  • 為了避免出現機率0的狀況,導致條件機率為零,導致無意義的狀況發生
  • 分母加整個文本有多少不同的字
  • 分子加一
  • 經過Laplace smoothing的算法後,就稱之為Multinomial Naive Bayes
  • 開算囉
  • 帶入條件機率的式子
  • sports > not sports,所以猜他是sports類別

Different probability assumption

  • 假設機率都是高斯的常態分佈
Example
  • 利用一些體態,預測是男生還是女生
  • 算平均與標準差
  • 下大於上,故猜測是female

Code

import numpy as np import pandas as pd import time from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB from sklearn.metrics import accuracy_score, confusion_matrix df = pd.read_csv("./dataset/titanic/train.csv") df['Sex'] = pd.Categorical(df['Sex']).codes df['Embarked'] = pd.Categorical(df['Embarked']).codes df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1) df = df.dropna(axis = 0, how = 'any') y = df['Survived'] x = df.drop('Survived', axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) model = GaussianNB() model.fit(x_train, y_train) y_pred = model.predict(x_test) print('Number of misabeled points out of total {} points : {}, performance {:05.2f}%' .format( x_test.shape[0], (y_test != y_pred).sum(), 100 * (1 - y_test != y_pred).sum() / x_test.shape[0]) ) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('number of correct sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy))
Number of misabeled points out of total 215 points : 55, performance 74.42%
number of correct sample : 160
accuracy : 0.7441860465116279

Random Forests

  • ensemble learning
    • 把多個分類器並在一起
  • 建立多棵dicision tree
  • 看哪一個類別多,他就是那個類別

Example

  • random 選取N個row,得到許多sub_data
  • 針對每個data,在隨機選取k個特徵
  • 將這k個特徵,拿去做dicision tree(CART)
  • 每筆sub_data都這樣做,會得到很多dicision tree
  • 接著看這n棵樹,判斷哪一個類別比較多,他就是那個類別

Code

import pandas as pd import numpy as np from sklearn import preprocessing, neighbors, datasets from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) y = df[7] x = df.drop(7, axis = 1).copy() x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = RandomForestClassifier(max_depth = 6, n_estimators =15) model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('number_correct_sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy))
number_correct_sample : 39
accuracy : 0.9285714285714286

Support Vector Machine

  • 找到一個超平面,讓資料間margin最大
  • H3最好,離兩個資料群最近的點最遠
  • supprot vector
    • 兩個類別最近的點
  • support hyperlane
    • 通過兩個類別最近的點的平面
  • margin
    • 兩條supprt hyperlane的距離
  • xi : 資料特徵
  • yi : 類別{+1, -1}



  • 在分母很難處理,把它變到分子,求最大值改成求最小值
  • Hard cost
    • 百分之百不允許有人越界
    • 一定會找到一個平面把兩個類別分開來
  • soft cost
    • 允許有人越界
    • ϵi
      : 越界多少,越界的程度
    • C : 懲罰程度
      • 越大代表越不允許越界
      • 越小代表越允許越界

Linear v.s. nonlinear problems

SVM Kernel Trick
  • 用某種函數把維度提升
  • 提高維度就有可能分類
Example
  • 二維變三維
  • 另外一個最佳化問題比較好算,才返回去算原來的最佳化問題
Common Kernel in SVM

Multi-class in SVM

  • 如果有kclass
    • Method 1 :one-against-rest
      • 每一個svm判斷是不是該類別
      • 產生k個二元svm來做分類
    • Method 2 :one-against-one

Code

import pandas as pd import numpy as np from sklearn import preprocessing, datasets, linear_model from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, confusion_matrix pima = pd.read_csv('./dataset/pima-indians-diabetes.csv') df=pima[['pregnant', 'insulin', 'bmi', 'age', 'label']] x=df[['pregnant', 'insulin', 'bmi', 'age']] y=df['label'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = SVC(kernel = 'rbf') model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('num_correct_sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy))
num_correct_sample : 160
accuracy : 0.6926406926406926

Classification Unsupervised Learning

K-means

  • k : 把這些資料分成k個類別(因為unsupervised沒類別的概念,所以要自己分類)
    • 隨便取k比資料,分成k類
    • 再將相近的資料,歸類為該類別
    • 算出新的重心
    • 找出同類別的資料的重心
    • 再重新用重心分類一次
    1. 不斷repeat上面四個步驟,直到收斂不改變
import numpy as np import pandas as pd from sklearn.cluster import KMeans data = pd.read_csv('./dataset/xclara.csv') print("Input Data and Shape") print(data.shape) f1 = data['V1'].values f2 = data['V2'].values x = np.array(list(zip(f1, f2))) model = KMeans(n_clusters = 3) model = model.fit(x) labels = model.predict(x) centroids = model.cluster_centers_ print('centroids : {}'.format(centroids)) print('prediction on each data : {}'.format(labels)) labels = model.predict(np.array([[12.0, 14.0]])) print('prediction on data point (12.0, 14.0) : {}'.format(labels))
Input Data and Shape
(3000, 2)
centroids : [[ 69.92418447 -10.11964119]
 [ 40.68362784  59.71589274]
 [  9.4780459   10.686052  ]]
prediction on each data : [2 2 2 ... 0 0 0]
prediction on data point (12.0, 14.0) : [2]

DBSCAN

  • Density-based spatial clustering of applications with noise
  • 以密度來衡量哪個資料屬於哪個類別
  • 優點:他很能抵抗noise data
  • 藍色區域密度大被歸為一類,紅色同理

Terminlogy in DBSCAN

  • Density : 密度
    • 指定某半徑,能匡到的資料數目即為密度
  • core point : 核心點
    • 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果超過我所設定最少量的資料點,就稱為core point
  • border point :
    • 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果少於我所設定最少量的資料點,但其中有匡到core point,就稱為border point
  • noise point :
    • 非core point 且非 border point

Directly Reachable

  • 假設有點p跟點q,兩點間連一條直線,經過的點不是core point就是border point,就稱為directly reachable
  • core point及附近的border point會被歸為同一類

Example

  • Directly reachable會被歸為一類

problem

  • 半徑沒設好會有一些問題

Code

import numpy as np import pandas as pd from sklearn import preprocessing, metrics from sklearn.cluster import DBSCAN from sklearn.datasets.samples_generator import make_blobs df = pd.read_csv('./dataset/iris.csv', header = None) df = df.drop(4, axis = 1) print("Input Data and Shape") print(df.head()) print(df.shape) x = np.array(df) model = DBSCAN(eps = 0.3, min_samples = 5).fit(x) labels = model.labels_ n_clusters = len(set(labels)) - (1 if -1 in labels else 0) print('number of clusters : {}'.format(n_clusters)) print('cluster on x {}'.format(labels))
Input Data and Shape
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
(150, 4)
number of clusters : 3
cluster on x [ 0  0  0  0  0 -1  0  0  0  0  0  0  0  0 -1 -1 -1  0 -1  0 -1  0 -1  0
  0  0  0  0  0  0  0 -1 -1 -1  0  0 -1  0  0  0  0 -1  0  0 -1  0  0  0
  0  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1  1  2 -1
 -1 -1 -1 -1 -1 -1 -1 -1  1  1  1 -1 -1 -1 -1 -1  1  1 -1 -1  1 -1  1  1
  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  2 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1  2]

EM

  • Expectation-Maximization
  • 藉由迭代,找到機率模型裡的參數

Maximum Likelihood Estimation

  • 給予一個機率分佈,想要找出一些參數
  • 數字overflow,所以取log處理
  • 估計整棵樹蘋果的平均重量
  • 將值帶入
  • 取log
  • 母體平均大約等於隨機抽樣的平均

Likelihood Estimation Table

Example

  • 欲將資料分為兩個類別
  • 隨機另兩個高斯曲線
E step
  • 一剛開始的狀況,因為是隨便給的,所以分得很爛
  • 計算資料隸屬於黃色的比例跟隸屬於藍色的比例
  • 最左邊的資料 : 70 % 是黃色類別,30 % 是藍色類別
  • 左二 : 40 % 是黃色類別,60 % 是藍色類別
  • 左三 : 10 % 是黃色類別,90 % 是藍色類別
M step
  • 去修正機率模型的參數(平均跟標準差)
  • 計算likelyhood,更新平均跟標準差,畫出新的高斯分佈
完成

缺點

  • 資料可能不是高斯常態分佈
  • 可以使用其他分佈

Different cluster method

Dimension Reduction

  • unsupervised
  • 讓資料壓縮
    • reduce time complexity
    • reduce space complexity
    • 增加可視化程度

Illustraion


Code

import pandas as pd import numpy as np from sklearn import metrics, mixture, preprocessing df = pd.read_csv('./dataset/iris.csv', header = None) df = df.drop(4, axis = 1) print('Input Date and Shape') print(df.head()) x = np.array(df) model = mixture.GaussianMixture(n_components = 3).fit(x) x_pred = model.predict(x) print(x_pred)
Input Date and Shape
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 0 2
 2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]

SVD

  • singular-value decomposition
  • 矩陣可拆分

Matrix rank

  • 第一個row剪掉第二個row會等於第三個row
  • 代表第三個row是多餘的,代表線性獨立的row只有兩個

Example_1

  • 聽不懂啦幹


  • v : 選轉矩陣
  • σ
    : 壓扁
  • U : 選轉矩陣

Example_2

  • 對電影的評分

  • 把數值最小的row及相對應的row或column砍掉

  • 相似於原始矩陣
  • 留下V
  • 把五維轉成二維

How many singular values to keep?

  • 盡量保持80%以上的資料

PCA

  • principal component analysis
  • 把所有資料投影在垂直座標軸上

Example

Which projection axis is better

  • 投影在紅線上好,還是綠線上好
  • 投影在紅線上好,離散程度較高
    • 較不會有兩點重疊
  • 投影在w1軸上,離散程度要最大
  • 投影在w2軸上,離散程度盡量大
    • 避免跟第一個軸一樣,所以兩軸要垂直

PCA concept

  • 找一個軸,離散程度最大的
  • 再找下一個軸,並確保跟其他軸內積相成等於零
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
Covariance

correlation

Covariance Matrix




Step

Example

Code(PCA)

import pandas as pd import numpy as np from sklearn.decomposition import PCA df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) df = df.drop(7, axis = 1) print(df.head()) x = np.array(df) pca = PCA(n_components = 3) pca.fit(x) x_reduced = pca.transform(x) print('singular values is {}'.format(pca.singular_values_)) print('after pca, all of data is reduced to 3D') print(x_reduced)
0      1       2      3      4      5      6
0  15.26  14.84  0.8710  5.763  3.312  2.221  5.220
1  14.88  14.57  0.8811  5.554  3.333  1.018  4.956
2  14.29  14.09  0.9050  5.291  3.337  2.699  4.825
3  13.84  13.94  0.8955  5.324  3.379  2.259  4.805
4  16.14  14.99  0.9034  5.658  3.562  1.355  5.175
singular values is [47.49531899 21.09635322  3.92284041]
after pca, all of data is reduced to 3D
[[ 6.63448376e-01 -1.41732098e+00  4.12356541e-02]
 [ 3.15666512e-01 -2.68922915e+00  2.31726953e-01]
 [-6.60499302e-01 -1.13150635e+00  5.27087232e-01]
 [-1.05527590e+00 -1.62119002e+00  4.37015260e-01]
 [ 1.61999921e+00 -2.18338442e+00  3.33990920e-01]
 [-4.76938007e-01 -1.33649437e+00  3.55360614e-01]
 [-1.84834720e-01 -1.50364411e-01  1.41497264e-01]
 [-7.80629616e-01 -1.12979883e+00  2.79757608e-01]
 [ 2.28210810e+00 -1.36001690e+00 -3.50729413e-01]
 [ 1.97854147e+00 -1.49468793e+00 -2.93947251e-03]
 [ 3.69122947e-01  8.86722511e-01  1.13264978e-01]
 [-7.11021200e-01 -2.10663730e+00  1.37552595e-01]
 [-1.21370535e+00  9.46878939e-02  4.85809237e-01]
 [-1.16908541e+00 -7.42962899e-01  2.58209340e-01]
 [-1.19272176e+00 -9.53268162e-01  2.58450639e-01]
 [-5.08171207e-01  3.77958424e-01  6.56217572e-01]
 [-1.37469698e+00  1.32290559e+00  8.01997838e-01]
 [ 1.05726438e+00 -2.01562875e+00  4.33972249e-01]
 [-1.50961097e-01 -2.02235813e+00  7.52022386e-01]
 [-2.46241293e+00  7.37473835e-02  2.12996833e-01]
 [-6.31332100e-01 -7.18305655e-01 -5.43487434e-02]
 [-6.89698660e-01 -1.11182531e+00 -1.69906624e-02]
 [ 1.40769072e+00 -2.80658086e+00  3.14889098e-01]
 [-2.84267672e+00 -2.66880642e+00 -5.42113806e-02]
 [ 4.33268215e-01 -1.88984464e+00  1.04814954e-01]
 [ 1.81289158e+00 -2.60002176e+00  5.34196985e-02]
 [-2.02131332e+00 -6.08743328e-01  1.84384422e-01]
 [-2.19571862e+00 -1.49837622e+00  2.35406301e-02]
 [-7.44468841e-01 -1.06518721e+00  1.56860725e-01]
 [-1.50350480e+00 -3.68206745e-01 -8.59614612e-03]
 [-1.52075320e+00 -3.06180225e+00 -1.73200953e-01]
 [ 7.61190256e-01 -2.09488759e-01  1.65627857e-01]
 [-7.67738428e-01  1.26295451e-01 -1.20622472e-01]
 [-8.23965933e-01 -1.70715020e+00  5.32025512e-02]
 [ 4.39542396e-01 -1.52858534e+00 -5.57773440e-02]
 [ 1.52205298e+00 -1.25609762e+00  1.35317730e-01]
 [ 1.65240525e+00 -6.75119440e-01 -3.33822258e-03]
 [ 2.47674445e+00 -4.51537548e-01  3.09238192e-01]
 [ 1.15750673e-02 -5.96250977e-01  3.52544529e-02]
 [-1.11443822e+00  2.83345206e+00  5.68536979e-01]
 [-1.37160170e+00 -1.30108591e+00  3.86783538e-02]
 [-1.36349513e+00 -1.63960124e+00  7.21418337e-03]
 [-1.88302954e+00 -1.51985066e+00  4.14988118e-01]
 [ 6.29560566e-01  1.10062048e+00  6.34515325e-03]
 [ 2.84412124e-01 -5.60641213e-01  2.98249761e-01]
 [-9.60044753e-01 -2.29723526e+00  1.41107629e-01]
 [ 8.18964617e-01 -2.26570276e+00  1.53990241e-01]
 [ 1.96621301e-01 -7.40666486e-01  2.29592441e-01]
 [ 1.53276057e-02 -1.02061146e+00  2.02590478e-01]
 [ 2.54235169e-01 -1.55022082e+00 -1.05705755e-01]
 [-5.05384531e-01  1.97773604e-01  1.75366006e-01]
 [ 7.11973095e-01  1.96593925e+00  5.15648832e-01]
 [-3.55829381e-01  3.79579753e-01 -1.55069874e-01]
 [-5.66856332e-01 -4.55332330e-01  8.98393269e-02]
 [ 1.80968891e-02 -2.21678367e+00 -3.93360924e-01]
 [ 4.78424858e-01 -1.71349178e+00 -1.93034894e-01]
 [-3.75464636e-01 -9.76631316e-01  3.10946012e-01]
 [ 2.83883316e-01 -2.56463849e+00  2.83654402e-01]
 [ 7.69429731e-01 -1.63154473e+00  1.52399604e-01]
 [-2.77110124e+00 -2.60034941e+00  2.33323254e-01]
 [-3.80344820e+00 -1.51695365e+00  2.37149111e-01]
 [-4.00534905e+00 -1.97086343e+00  2.03609501e-01]
 [-2.87823982e+00 -8.86757382e-01  4.63038181e-01]
 [-1.87406423e+00  2.13355284e-01  7.89854003e-02]
 [-2.05089330e+00 -2.82503499e+00  1.19739519e-01]
 [-2.16820614e+00 -1.67334585e+00  4.54792497e-01]
 [-2.59673795e-01 -2.44512020e+00 -6.00299972e-02]
 [-7.07084427e-01 -1.59064814e+00 -5.52933377e-02]
 [-2.37114217e-01 -2.28118595e+00 -1.49729134e-01]
 [-2.28478840e+00 -4.60103809e-01 -1.17460180e-01]
 [ 3.16409179e+00  8.04126878e-01 -2.66641999e-01]
 [ 2.20955077e+00  1.27850851e+00 -1.57740826e-01]
 [ 2.62062165e+00  1.18221241e+00  3.71401616e-02]
 [ 4.76779625e+00 -1.57604721e-01  9.29992692e-02]
 [ 2.21232336e+00  6.01138596e-01 -1.39544203e-01]
 [ 2.07181452e+00  1.50200244e+00 -7.00860345e-02]
 [ 2.84275994e+00  5.04000805e-01 -2.39983894e-01]
 [ 6.46235442e+00  1.60085825e+00 -1.52794770e-01]
 [ 4.47829765e+00  1.97530486e+00 -3.07050833e-01]
 [ 2.61486732e+00 -5.12991394e-01  2.02041872e-02]
 [ 1.67836371e+00  2.07292392e+00 -4.87254221e-02]
 [ 4.03755261e+00  2.14071720e+00  3.50841784e-01]
 [ 5.71859177e+00  2.21396348e+00  1.90304067e-01]
 [ 5.58807242e+00 -1.50990711e+00 -3.07129758e-01]
 [ 5.32256756e+00 -5.11477550e-02 -1.35205623e-01]
 [ 4.00732381e+00 -7.29979610e-01 -3.00704179e-01]
 [ 4.70505178e+00 -1.45419923e+00 -9.90969343e-02]
 [ 4.79038389e+00  6.44967998e-01 -5.69065929e-01]
 [ 6.69570804e+00  2.94421273e+00  3.03912945e-01]
 [ 6.46037602e+00  2.15264305e+00  2.01205274e-01]
 [ 6.14337744e+00 -9.43926428e-01 -4.15476808e-01]
 [ 4.39514760e+00 -1.61012778e-02 -1.54841426e-03]
 [ 4.46139621e+00  1.12624509e-01 -7.95827310e-02]
 [ 3.78494886e+00  2.79030096e+00  3.90137034e-01]
 [ 4.01629279e+00  1.80247983e+00 -6.82260736e-01]
 [ 2.38051551e+00  3.23431649e-01 -3.38653786e-01]
 [ 5.03718740e+00  4.35062776e-01 -1.48395998e-01]
 [ 4.92049767e+00 -8.97804759e-01 -6.05902714e-01]
 [ 3.94065862e+00 -3.15810290e-01 -4.91695711e-01]
 [ 4.53327546e+00 -9.29513362e-01 -2.00722187e-01]
 [ 1.65697361e+00  7.28440774e-01  1.41627453e-01]
 [ 3.63867456e+00 -1.18043005e+00  6.80825411e-02]
 [ 4.97852528e+00  1.24163180e+00  2.63041542e-01]
 [ 4.94143745e+00  3.05328003e-01 -2.47908931e-01]
 [ 4.63589248e+00  2.70940363e-01 -1.14295115e-01]
 [ 4.52405499e+00 -5.83403419e-01  1.38095195e-01]
 [ 4.51574435e+00 -2.71311394e-01 -8.74599219e-02]
 [ 3.12281151e+00  4.56212973e-01 -8.62751578e-02]
 [ 5.83137312e+00  3.30385849e-01 -4.76189658e-01]
 [ 4.35718398e+00 -1.41741787e+00 -6.25084495e-02]
 [ 4.15756069e+00 -9.50833185e-01  9.65051938e-02]
 [ 5.08259001e+00  6.24704579e-01  6.28827972e-02]
 [ 4.89139771e+00 -9.82917842e-01  1.28275950e-01]
 [ 4.44322173e+00  3.57224575e+00  1.56974171e-01]
 [ 6.67155248e+00  1.84060496e+00  9.12010421e-02]
 [ 4.90749190e+00 -8.18130733e-01 -2.55684480e-01]
 [ 4.37369427e+00  1.17679493e+00  4.41493473e-01]
 [ 4.87191916e+00  1.48790820e-02 -9.56013893e-02]
 [ 4.44858857e+00  5.06661089e-01  9.35385490e-02]
 [ 5.88457254e+00  1.27049077e-01 -1.80453388e-01]
 [ 5.68383523e+00  2.94064884e+00  2.60580298e-01]
 [ 3.70570300e+00  4.03190365e-01 -1.09288956e-01]
 [ 1.48852809e+00  7.87950654e-01 -8.22192480e-02]
 [ 3.98326071e+00 -2.15055765e-01  1.49554483e-01]
 [ 1.15533338e+00 -2.55675983e-01  5.98162419e-01]
 [ 4.23450934e+00  1.03173593e+00  1.64406843e-01]
 [ 4.21700578e+00  1.24930690e+00 -1.56980693e-01]
 [ 3.62299759e+00 -9.85552845e-01 -1.99064864e-02]
 [ 6.17386444e+00 -1.00395914e+00 -1.89332536e-01]
 [ 2.71378793e+00  2.00947941e+00  3.92678397e-01]
 [ 3.86089703e+00 -3.73513274e-01  8.03597573e-02]
 [ 4.61500274e+00 -2.10261231e-01  9.79900380e-02]
 [ 5.92114800e-01  8.66320178e-01 -3.00484271e-01]
 [ 1.48591457e+00  7.74439749e-01 -1.72450887e-01]
 [ 6.90660085e-01  1.38976691e+00 -1.68271877e-01]
 [ 5.30993584e-01 -4.14621634e-02  2.10574049e-01]
 [ 2.89265049e+00  2.11576887e-01 -2.20244499e-01]
 [ 1.10275966e+00 -8.95136177e-01 -5.29279898e-01]
 [ 1.08106342e+00 -8.23294917e-01 -3.54727287e-01]
 [ 1.58041453e+00  2.92679166e-01 -2.25284288e-01]
 [-2.08047727e+00  1.36508853e+00 -1.99716808e-01]
 [-2.04891061e+00  3.11005161e+00 -6.38273447e-02]
 [-1.93112897e+00  2.06805453e+00  3.42899489e-02]
 [-3.14761497e+00  1.38674593e+00 -2.02387013e-02]
 [-3.35755431e+00  3.62389743e-01 -2.78959609e-01]
 [-4.22241525e+00  1.97238736e+00 -3.44133543e-01]
 [-3.55211446e+00 -1.92651406e+00 -3.78402644e-01]
 [-2.74248901e+00  3.68275745e-01  9.39425325e-02]
 [-2.26028697e+00 -7.15757433e-01 -3.01514983e-01]
 [-4.59256683e+00  1.21366537e+00 -4.10485088e-01]
 [-3.49115131e+00  1.07925934e+00 -2.38194431e-01]
 [-3.44038060e+00  2.89298232e+00 -2.07485411e-01]
 [-2.88398331e+00  7.17984884e-01 -3.58805444e-01]
 [-3.96469498e+00 -8.67031460e-01 -2.74174690e-01]
 [-3.85801218e+00 -1.19622949e-01 -3.44490680e-01]
 [-4.23854664e+00  1.60807548e+00 -2.98762740e-01]
 [-3.89609307e+00 -8.50339426e-01 -6.58059699e-02]
 [-2.98607257e+00  7.68415914e-01 -3.42792920e-01]
 [-3.33747668e+00  2.84753275e-01 -5.22692614e-01]
 [-3.83091968e+00  1.23660236e+00 -3.79258301e-01]
 [-2.36752198e+00 -8.93936559e-01 -5.13938956e-01]
 [-3.15770694e+00  1.92519249e-01 -3.20706760e-01]
 [-3.23143479e+00  8.85496109e-01 -4.51321791e-02]
 [-2.61469342e+00  3.94920094e-01 -8.06497416e-02]
 [-4.49824136e+00  2.13615465e+00  6.33923291e-02]
 [-2.94329647e+00 -1.88566242e+00 -4.82086609e-02]
 [-2.76609970e+00  8.91774133e-01 -1.72181504e-01]
 [-2.89907369e+00 -4.09293999e-01 -4.02323115e-01]
 [-3.90252444e+00  1.58340894e-01 -2.77227460e-01]
 [-3.95458414e+00 -6.73045369e-01 -2.41546685e-01]
 [-4.52101870e+00  2.49812224e+00 -2.49745319e-01]
 [-4.04118364e+00  2.51581412e+00  1.29405374e-01]
 [-4.04672999e+00  1.00738290e-01 -9.09399508e-02]
 [-4.03387689e+00  1.39431425e+00 -9.26707967e-02]
 [-4.51655795e+00  9.40408717e-01 -4.06201269e-01]
 [-4.67880742e+00  4.91842458e-01 -5.80895287e-02]
 [-4.15214114e+00  1.12758416e+00 -1.65527388e-01]
 [-4.67134302e+00  4.21033845e-01 -1.71844958e-01]
 [-4.01783124e+00  1.67978706e+00  3.14971477e-03]
 [-2.60793947e+00 -2.37304506e+00 -3.55362015e-01]
 [-4.03445958e+00  7.40517346e-01  1.30612374e-01]
 [-2.84072766e+00  9.33512708e-01  5.46936118e-02]
 [-3.09276593e+00  7.75654915e-01 -5.61725952e-02]
 [-3.75632591e+00  1.04703805e+00 -4.29715405e-02]
 [-2.41503022e+00  2.20440679e+00 -8.67431609e-02]
 [-3.57450320e+00 -7.19443063e-02 -4.01721420e-01]
 [-3.37275950e+00  8.03901892e-01 -4.60390433e-01]
 [-4.43114913e+00 -7.75918129e-02 -1.41347439e-01]
 [-4.55059431e+00  3.26576882e+00  1.99485527e-01]
 [-5.00252227e+00  6.36766609e-01  1.71812058e-01]
 [-4.55827661e+00  1.13664276e+00 -9.51701336e-02]
 [-4.04374701e+00 -2.25814885e-01 -6.87186458e-02]
 [-3.36161741e+00 -5.27869965e-01 -4.74331964e-02]
 [-4.56094950e+00  5.95574083e-01 -2.83496804e-01]
 [-3.11855689e+00  3.31944329e-02  3.57435656e-02]
 [-2.52946420e+00  8.37109623e-01  3.64003383e-01]
 [-2.58655448e+00  1.44846001e+00  3.01071387e-01]
 [-1.83336074e+00  7.30693089e-01  2.15840207e-01]
 [-2.36059318e+00 -6.86820620e-01 -2.53316928e-01]
 [-2.35819688e+00 -1.20488410e+00  3.56048199e-01]
 [-2.97998867e+00  1.39806029e+00  1.31248700e-01]
 [-2.41873891e+00 -1.74952111e+00  4.09312427e-01]
 [-4.21924202e+00 -1.94251854e-01  1.18503747e-01]
 [-3.08869155e+00  4.37638669e+00  4.99058361e-01]
 [-2.78962325e+00 -1.41941777e-01  5.02954010e-02]
 [-3.04187227e+00 -4.73126171e-01  1.95045363e-01]
 [-4.10906270e+00  1.09340872e-01 -8.74005598e-02]
 [-2.50003394e+00  4.30796502e+00  5.32818431e-01]
 [-3.33207854e+00 -5.25289746e-01 -9.81079349e-02]
 [-3.10755116e+00  1.54975743e+00  1.21282793e-01]]

Kaggle 實戰

https://www.kaggle.com/c/titanic

Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

Metric

Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Submission File Format

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order)
Survived (contains your binary predictions: 1 for survived, 0 for deceased)

​​​​PassengerId,Survived
​​​​892,0
​​​​893,1
​​​​894,0
​​​​Etc.

匯入資料

train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv') combine = [train_df, test_df]

觀察資料

看header

print(train_df.columns.values)
train_df.info() print('_'*40) test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Generate descriptive statistics.

train_df.describe()

評分標準

round(model.score(X_train, Y_train) * 100, 2)

匯出CSV

submission = pd.DataFrame({ "PassengerId": test_df["PassengerId"], "Survived": Y_pred }) submission.to_csv('submission.csv', index=False)