機器學習

tags: `數據分析` `機器學習`

機器學習概論

依照輸入分類

Supervised Learning

輸入資料有答案(label)

Unsupervised Learning

輸入資料沒答案(label)
Cluster unlabelled data
- 讓電腦把資料相似的分類在一起，相異的放遠一點

Semi-supervised Learning

一半的資料有答案，一半的資料沒答案

Reinforcement Learning

給她沒答案的資料
如果答對給予正回饋，答錯就給負回饋

依照輸出分類

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Regression

預測的東西是一個連續的數值
ex:
- 股價預測
- 職棒比賽，確切比數
預測精確數值，難度較高

Classification

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

預測的東西是一個類別
ex:
- 股價漲或跌的預測
- 職棒比賽，誰贏誰輸
不需要精確預測，難度較簡單

Machine learning workflow

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

資料輸入

特徵工程

資料缺失值

數字型的
- 整筆資料不要用
- 填平均值或中間值
類別型的
- 取眾數
- 設立一個其他類別

極端值

刪除極端值

Split data

Training data
- 訓練資料
- 通常80%
- 通常大於testing data
Testing data
- 測試資料
- 通常20%
- 通常小於traing data

Normalization

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

數量級差太多，學到的參數的數量級會相差非常大
讓每個值的欄位壓到0跟1之間或-1跟1之間
常用方法
- min_max
  - [0,1]
  - $\frac{x_{i} - x_{m i n}}{x_{m a x} - x_{m i n}}$
- Z-Score Standardization
  - [-1,1]
  - $\frac{x_{i} - μ}{σ}$

Select model

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Validate trained model - regression

拿來評量模型表現好不好

MSE

$M S E = \frac{1}{n} \sum_{i = 1}^{n} (f_{i} - y_{i})^{2}$
$f_{i}$ : 預測值
$y_{i}$ : 實際答案
n: 數值數量
值越小，代表誤差越小

R squared(coefficient of determination)

觀念問號
$R^{2} = 1 - \frac{\sum_{i = 1}^{n} (f_{i} - \bar{y})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})}$
$y_{i}$ : 實際答案
$f_{i}$ : 預測值
$\bar{y}$ : 答案的平均
預測的離散程度跟答案的離散程度是否相近
越接近1代表表現越好，值越小表現越爛

Validate trained model - classification

$a c c u r a c y = \frac{# o f c o r r e c t p r e d i c t i o n}{# o f d a t a}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$a c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

True Positive (TP)「真陽性」:真實情況是「有」，模型說「有」的個數。
True Negative(TN)「真陰性」:真實情況是「沒有」，模型說「沒有」的個數。
False Positive (FP)「偽陽性」:真實情況是「沒有」，模型說「有」的個數。
False Negative(FN)「偽陰性」:真實情況是「有」，模型說「沒有」的個數。
$P r e c i s i o n = \frac{T P}{T P + F P}$
$R e c a l l = \frac{T P}{T P + F N}$

F1-score

$F 1 = 2 * \frac{1}{\frac{1}{r e c a l l} + \frac{1}{p r e c i s i o n}} = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}$
越接近1表現越好，越接近0越差

ROC

針對不同的門檻值去畫出ROC curve
AUC(area under curve)
- AUC=0.5 (no discrimination 無鑑別力)
- 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)
- 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)
- 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)

Confusion matrix

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

把二元分類變多元分類
對角線上是猜對的

Bias-Variance Tradeoff

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Variance : 穩定與不穩定
Bias : 資料集不集中
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
選擇模型太簡單（變數太少之類的），那會落到high bias的區域
選擇模型太複雜，會落到high variance的區域，會不穩定
- overfitting

No free lunch theory

沒有任何一個演算法可以勝過其他所有的演算法
根據不同問題有不同的解法

輸出

應用領域

自動駕駛
支付
- 人臉辨識支付
醫療應用
- 腫瘤偵測
機器人理財
無人機
- 自動避障
預測推銷
- youtube推薦演算法
智慧工廠
語音智慧助理
- siri

Regression Supervised Learning

Linear Regression

依現有的data，找到一條趨勢線，預測未知的data
線性解

Caes

Suppose 房價只跟屋齡有關

可以依照data畫出多條趨勢線

利用cost function 找到最適合（誤差最小的）的趨勢線

C o s t = \sum_{i = 1}^{m} (y_{i} - \bar{y_{i}})^{2}

微積分求解

Multivariate Linear Regression Models

多變量時難以用公式解，變數太多了
通常用gradient descent

Gradient Descent

只能找區域最小值
learning rate太小會算很久，太大找不到最佳值，通常用0.1 到 1 之間

Gradient Descent V.S. Normal Equation

Gradient Descent
- learning rate要剛剛好
- 要迭代很多次
- n大時會比較好（資料量大）
Normal Equation
- 不用選learning rate
- 要計算矩陣逆運算，較麻煩
- n大時算很慢（因為矩陣逆運算）

Overfitting

最左（underfitting)
- 無法找到理想趨勢
中間（理想狀況）
最右（overfitting)
- 在train data上fit的很好，但有新東西就gg了
- 像死讀書，刷考古題的學生
- 變數越多越容易overfitting

Code



































from sklearn import preprocessing, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


# 資料輸入
df = pd.read_csv('./dataset/housing.csv', header = None, delim_whitespace=True)
# 答案取出
y = df[13]
x = df.drop(13, axis = 1)


# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1)


# Normalization
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Model Select
model = linear_model.LinearRegression()
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)

print('Cofficient : {}'.format(model.coef_))

print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred)))

print('Variance score : {}'.format(r2_score(y_test, y_pred)))

Cofficient : [-1.02554944  0.96553896  0.16729159  0.58076865 -2.02969655  2.57536082
  0.17046044 -2.84987298  2.50778431 -1.85852862 -2.05829633  0.82864609
 -3.86813384]
Mean squared error : 34.403276579602064
Variance score : 0.6739078917414478

Polynomial Regression

非線性解
低維度變高維度

nth-degree Polynomial Regression

n次多項式，當作多個feature
其他model

Graph

交叉項(cross term)

兩變數有加減乘除的關係（多為乘法）
原本兩個feature，經由相乘或是次方，多出很多個feature





































import numpy as np
import pandas as pd
from sklearn import linear_model, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

# 匯入檔案
df = pd.read_csv('./dataset/winequality-red.csv')

# 處理answer and data
y = df['quality']
x = df.drop('quality', axis = 1)

# 產生degree 為 2 的feature
poly = PolynomialFeatures(degree = 2).fit(x)
x = poly.transform(x)

# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)

# Normalization
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Select model
model = linear_model.LinearRegression()
model.fit(x_train, y_train)

# Predict
y_pred = model.predict(x_test)

# 查看係數
print('The coefficient : {}\n'.format(model.coef_))
print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred)))
print('Variance score : {}'.format(r2_score(y_test, y_pred)))

The coefficient : [-2.20439093e-11 -3.32036477e+01 -3.46762494e+01 -1.98569253e+01
 -1.40652656e+01 -6.85273381e+01 -7.74114997e+01  1.00727869e+02
 -2.50880009e+01 -7.93478035e+01  4.63945174e+01 -6.70038809e+00
 -1.07912073e+00 -4.28103408e-01 -1.80118094e-01 -5.27741073e-01
 -9.30638006e-01 -6.23191815e-01  5.34638897e-01  3.69394013e+01
 -1.63849368e+00  6.25798078e-01  6.86240624e-02 -1.07543978e-01
  3.25708843e-02 -6.49414077e-02  2.15989372e-01 -5.66509424e-02
  3.43575308e-01  3.51401508e+01 -1.05497188e+00 -2.55960941e-01
  8.98178254e-01 -3.09937110e-02  9.34787505e-02  8.98840857e-02
  1.48808583e-01 -3.20881226e-02  2.21970228e+01 -3.13608092e+00
 -3.12173913e-01  1.11046257e+00 -1.59627758e-01 -2.03532456e-01
  3.43111110e-02 -6.51049711e-04  1.79109779e+01 -2.69409190e+00
  1.29726359e-02 -3.65247752e-01  6.44755908e-02 -5.20131174e-02
 -1.18159029e-01  6.92374227e+01 -5.56852862e-01  1.09112468e-01
  5.56498217e-01 -1.63840021e-01 -3.70231215e-02  7.79292790e+01
 -5.47020463e-01 -4.70828794e-01  1.29740195e+00  8.93119391e-02
 -1.01689650e+02  1.28353602e+00  4.09945512e-01 -1.61154603e+00
  2.10910693e+01  8.08256509e+01 -4.72908479e+01  4.26542501e+00
 -2.92688018e+00  1.05074947e+00  2.71306226e+00 -3.52447248e-01
  7.49136902e-02 -1.16178316e-01]

Mean squared error : 0.4308785588449687
Variance score : 0.29206509289757054

Logistic Regression

X : feature
$θ$ : 參數

Sigmoid function

Output is [0, 1]
model的輸出常被拿來當作機率

Cross-entropy)(Cost Function)補)

給電腦兩個機率分佈，分析兩個機率分佈相不相似
兩機率分佈幾乎一樣，算出來越小

$C (θ_{0}, θ_{1}, . . ., θ_{n}) = \frac{1}{2 m} \sum_{i = 1}^{m}$

Information（資訊含量）

pi : 發生某事的機率
Define Information :
$l o g (\frac{1}{p i})$
當某事機率越大，資訊含量越低
- 太陽從東邊升起的機率是100 %
- Information =
  $l o g (\frac{1}{1}) = 0$
當某事機率越小，資訊含量越高
- 明天下雨的機率是50 %
- Information =
  $l o g (\frac{1}{\frac{1}{2}}) = 0.3$

Entropy V.S. Cross-entropy

Entropy

資訊含量的期望值
資訊含量越不確定時越高，越確定時越低

Cross-entropy

輸入兩個機率分佈
衡量兩個機率分佈相不相近

Example

Cross entropy 通常大於entropy

Cost Function(cross-entropy)(問號)

h 與 1 - h都是機率分佈
y 跟 1 - y，非一即零，

Learning

Gradient descent

Code
































import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


pima = pd.read_csv('./dataset/pima-indians-diabetes.csv')

#x = pima[['pregnant', 'insulin', 'bmi', 'age']]
y = pima['label']
x = pima.drop(['label'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)

scaler = preprocessing.StandardScaler().fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model = LogisticRegression()
model.fit(x_train, y_train)

print(model.coef_)

print(model.intercept_)

y_pred = model.predict(x_test)

print(y_pred)
accuracy = model.score(x_test, y_test)

print(accuracy)

[[ 0.33498141  1.03029784 -0.30681406 -0.02318156 -0.07418398  0.67294487
   0.2001099   0.20383277]]
[-0.90342628]
[0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0
 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0
 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0
 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0
 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0]
0.7835497835497836

Classification Supervised Learning

K-Nearest Neighbor

可同時classification跟regression
通常用來classification
無變數的
只需要決定k值
k : 有多少鄰近的類別
k = 1
- 最近的一個是正方形，所以他是正方形類別
k = 3
- 最近的三個是一個正方刑跟兩個三角形，三角形較多，故為三角形類別
k = 7
- 最近的七個是四個正方刑跟三個三角形，正方形較多，故為正方形類別

Step

Look at data

把原本的資料做好分類

Caculate distances

將每筆data跟所求data的距離求出來

Find neighbors

找最近的k個資料

Vote from labels

最後做投票
看哪一個類型最多

How to Define Distance

Manhattan distance(L1 distance)
Euclidean Distance(L2 distance)

How to choose K

通常取奇數
- 必可以選出來，不會有相等的狀況
K is small
- 易受不好的資料影響
K is large
- 不小心預測出多數的類別
- 永遠預測到同一個類別

Curse of dimensionality

利用次方增維
在高維度較容易分類
如果太高維，會造成overfitting

Code





























import pandas as pd
import numpy as np
from sklearn import preprocessing, datasets, neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)

y = df[7] - 1
x = df.drop(7, axis = 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

scaler = preprocessing.StandardScaler().fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model = neighbors.KNeighborsClassifier(n_neighbors = 5)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
num_correct_samples = accuracy_score(y_test, y_pred, normalize = False)

print('number of correct samples : {}'.format(num_correct_samples))
print('accuracy : {}'.format(accuracy))

number of correct samples : 39
accuracy : 0.9285714285714286

Decision Tree

把決策過程變成一棵樹的結構

How to split on each node?

How to define a good split?

決策效果極明顯的，比如如果A就一定Yes，反之則No，則A就是一個好的node
gain = 分割前 - 分割後
gain越大越好

CART

Classification and Regression Trees(CART)
Binary tree
Gini為分割良好指標

Gini Impurity

公式
越勢均力敵時，gini越大，反之，gini越小
gini越小越好

Exampl_1

分割後，求出gini的期望值
A
- 0.5 - 0.4852 = 0.015
B
- 0.5 - 0.37 = 0.13
B較佳

Example_2

遇到數值型的
- 取平均當分界線(在這邊是80)

ID3

Iterative Dichotomiser 3(ID3)
可以有多個分支

Entropy

公式
越勢均力敵，entropy越大，反之，entropy越小
Entropy 越小越好

Example_1

分割後，求出entropy的期望值
A
- 0.301 - 0.294 = 0.007
B
- 0.301 - 0.242 = 0.069
B較佳

Example_2

Before splitting
After splitting
對所有欄位都split看看，取最好的
Overcast 的狀況都會是yes，所以就不需要決策node了
重複前面的動作，找出其他的節點
完成

優點

寫成ifelse，可解釋性高

Step

Pruning

為了分類，可能會overfitting，決策樹長很深
Pre-prunung
- 設一些條件，如果達到條件就不往下長
- 如果node所含的資料少於特定大小，就不繼續往下長
Post-prunung
- 長完後，取自己想要的部分

Code(CART)




































import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('./dataset/abalone.csv', header = None)

Mean = np.mean(df[8])

df[0] = pd.Categorical(df[0]).codes
df[8] = df[8].apply(lambda x : 0 if x > Mean else 1)


y = df[8]
x = df.drop(8, axis = 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model = DecisionTreeClassifier()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
num_correct_samples = accuracy_score(y_test, y_pred, normalize = False)
con_matrix = confusion_matrix(y_test, y_pred)

print('number of correct sample : {}'.format(num_correct_samples))
print('accuracy : {}'.format(accuracy))
print('con_matrix : {}'.format(con_matrix))

number of correct sample : 595
accuracy : 0.7117224880382775
con_matrix : [[318 118]
 [123 277]]

Naive Bayes

機率分類器
條件機率
資料量大時很適合用Naive Bayes
特徵越多會呈線性成長
文本分類

How Naive Bayes Classifier Work

看哪邊機率大，就屬於那一邊
轉成數學式
簡化後
- 看哪邊機率比較大，就屬於哪一類

Example

想知道句子是否跟運動有關
想知道"a very close game"是否跟運動有關
推導後

How to calculate term?

$p (g a m e | S p o r t s) = \frac{2}{11}$
11 : 在sports的條件下，有十一個字
2 : 在sports的條件下，出現過2次games
竟量不要有0的狀況，因為這樣機率直接變零了

Laplace smoothing

為了避免出現機率0的狀況，導致條件機率為零，導致無意義的狀況發生
分母加整個文本有多少不同的字
分子加一
經過Laplace smoothing的算法後，就稱之為Multinomial Naive Bayes
開算囉
帶入條件機率的式子
sports > not sports，所以猜他是sports類別

Different probability assumption

假設機率都是高斯的常態分佈

Example

利用一些體態，預測是男生還是女生
算平均與標準差
下大於上，故猜測是female

Code





































import numpy as np
import pandas as pd
import time
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

df = pd.read_csv("./dataset/titanic/train.csv")

df['Sex'] = pd.Categorical(df['Sex']).codes
df['Embarked'] = pd.Categorical(df['Embarked']).codes

df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
df = df.dropna(axis = 0, how = 'any')

y = df['Survived']
x = df.drop('Survived', axis = 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

model = GaussianNB()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print('Number of misabeled points out of total {} points : {}, performance {:05.2f}%'
      .format(
          x_test.shape[0], 
          (y_test != y_pred).sum(),
          100 * (1 - y_test != y_pred).sum() / x_test.shape[0])
      )
accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)

print('number of correct sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))

Number of misabeled points out of total 215 points : 55, performance 74.42%
number of correct sample : 160
accuracy : 0.7441860465116279

Random Forests

ensemble learning
- 把多個分類器並在一起
建立多棵dicision tree
看哪一個類別多，他就是那個類別

Example

random 選取N個row，得到許多sub_data
針對每個data，在隨機選取k個特徵
將這k個特徵，拿去做dicision tree(CART)
每筆sub_data都這樣做，會得到很多dicision tree
接著看這n棵樹，判斷哪一個類別比較多，他就是那個類別

Code






























import pandas as pd
import numpy as np

from sklearn import preprocessing, neighbors, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier


df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)

y = df[7]
x = df.drop(7, axis = 1).copy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model = RandomForestClassifier(max_depth = 6, n_estimators =15)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)

print('number_correct_sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))

number_correct_sample : 39
accuracy : 0.9285714285714286

Support Vector Machine

找到一個超平面，讓資料間margin最大
H3最好，離兩個資料群最近的點最遠
supprot vector
- 兩個類別最近的點
support hyperlane
- 通過兩個類別最近的點的平面
margin
- 兩條supprt hyperlane的距離
xi : 資料特徵
yi : 類別{+1, -1}
在分母很難處理，把它變到分子，求最大值改成求最小值
Hard cost
- 百分之百不允許有人越界
- 一定會找到一個平面把兩個類別分開來
soft cost
- 允許有人越界
- $ϵ_{i}$ : 越界多少，越界的程度
- C : 懲罰程度
  - 越大代表越不允許越界
  - 越小代表越允許越界

Linear v.s. nonlinear problems

SVM Kernel Trick

用某種函數把維度提升
提高維度就有可能分類

Example

二維變三維
另外一個最佳化問題比較好算，才返回去算原來的最佳化問題

Common Kernel in SVM

Multi-class in SVM

如果有ｋclass
- Method 1 :one-against-rest
  - 每一個svm判斷是不是該類別
  - 產生k個二元svm來做分類
- Method 2 :one-against-one
  - 每一個svm判斷是第m個還是第n個類別
  - 共會有
    $\frac{k (k - 1)}{2}$ 種
    ![](https://i.imgur.com/TGS3FT4.png

Code






























import pandas as pd
import numpy as np
from sklearn import preprocessing, datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

pima = pd.read_csv('./dataset/pima-indians-diabetes.csv')

df=pima[['pregnant', 'insulin', 'bmi', 'age', 'label']]

x=df[['pregnant', 'insulin', 'bmi', 'age']]
y=df['label']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)

scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

model = SVC(kernel = 'rbf')
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)

print('num_correct_sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))

num_correct_sample : 160
accuracy : 0.6926406926406926

Classification Unsupervised Learning

K-means

k : 把這些資料分成k個類別（因為unsupervised沒類別的概念，所以要自己分類）
- 隨便取k比資料，分成k類
- 再將相近的資料，歸類為該類別
- 算出新的重心
- 找出同類別的資料的重心
- 再重新用重心分類一次
1. 不斷repeat上面四個步驟，直到收斂不改變























import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

data = pd.read_csv('./dataset/xclara.csv')
print("Input Data and Shape")
print(data.shape)

f1 = data['V1'].values
f2 = data['V2'].values
x = np.array(list(zip(f1, f2)))

model = KMeans(n_clusters = 3)
model = model.fit(x)
labels = model.predict(x)

centroids = model.cluster_centers_

print('centroids : {}'.format(centroids))
print('prediction on each data : {}'.format(labels))

labels = model.predict(np.array([[12.0, 14.0]]))
print('prediction on data point (12.0, 14.0) : {}'.format(labels))

Input Data and Shape
(3000, 2)
centroids : [[ 69.92418447 -10.11964119]
 [ 40.68362784  59.71589274]
 [  9.4780459   10.686052  ]]
prediction on each data : [2 2 2 ... 0 0 0]
prediction on data point (12.0, 14.0) : [2]

DBSCAN

Density-based spatial clustering of applications with noise
以密度來衡量哪個資料屬於哪個類別
優點：他很能抵抗noise data
藍色區域密度大被歸為一類，紅色同理

Terminlogy in DBSCAN

Density : 密度
- 指定某半徑，能匡到的資料數目即為密度
core point : 核心點
- 定義了某半徑，以某資料點為圓心，做一個圓，匡到了一些資料點，如果超過我所設定最少量的資料點，就稱為core point
border point :
- 定義了某半徑，以某資料點為圓心，做一個圓，匡到了一些資料點，如果少於我所設定最少量的資料點，但其中有匡到core point，就稱為border point
noise point :
- 非core point 且非 border point

Directly Reachable

假設有點p跟點q，兩點間連一條直線，經過的點不是core point就是border point，就稱為directly reachable
core point及附近的border point會被歸為同一類

Example

Directly reachable會被歸為一類

problem

半徑沒設好會有一些問題

Code






















import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs

df = pd.read_csv('./dataset/iris.csv', header = None)
df = df.drop(4, axis = 1)
print("Input Data and Shape")
print(df.head())
print(df.shape)

x = np.array(df)

model = DBSCAN(eps = 0.3, min_samples = 5).fit(x)
labels = model.labels_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print('number of clusters : {}'.format(n_clusters))
print('cluster on x {}'.format(labels))

Input Data and Shape
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
(150, 4)
number of clusters : 3
cluster on x [ 0  0  0  0  0 -1  0  0  0  0  0  0  0  0 -1 -1 -1  0 -1  0 -1  0 -1  0
  0  0  0  0  0  0  0 -1 -1 -1  0  0 -1  0  0  0  0 -1  0  0 -1  0  0  0
  0  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1  1  2 -1
 -1 -1 -1 -1 -1 -1 -1 -1  1  1  1 -1 -1 -1 -1 -1  1  1 -1 -1  1 -1  1  1
  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  2 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1  2]

EM

Expectation-Maximization
藉由迭代，找到機率模型裡的參數

Maximum Likelihood Estimation

給予一個機率分佈，想要找出一些參數
數字overflow，所以取log處理
估計整棵樹蘋果的平均重量
將值帶入
取log
母體平均大約等於隨機抽樣的平均

Likelihood Estimation Table

Example

欲將資料分為兩個類別
隨機另兩個高斯曲線

E step

一剛開始的狀況，因為是隨便給的，所以分得很爛
計算資料隸屬於黃色的比例跟隸屬於藍色的比例
最左邊的資料： 70 % 是黃色類別，30 % 是藍色類別
左二： 40 % 是黃色類別，60 % 是藍色類別
左三： 10 % 是黃色類別，90 % 是藍色類別

M step

去修正機率模型的參數(平均跟標準差)
計算likelyhood，更新平均跟標準差，畫出新的高斯分佈

完成

缺點

資料可能不是高斯常態分佈
可以使用其他分佈

Different cluster method

Dimension Reduction

unsupervised
讓資料壓縮
- reduce time complexity
- reduce space complexity
- 增加可視化程度

Illustraion

Code














import pandas as pd
import numpy as np
from sklearn import metrics, mixture, preprocessing

df = pd.read_csv('./dataset/iris.csv', header = None)
df = df.drop(4, axis = 1)
print('Input Date and Shape')
print(df.head())

x = np.array(df)
model = mixture.GaussianMixture(n_components = 3).fit(x)
x_pred = model.predict(x)

print(x_pred)

Input Date and Shape
     0    1    2    3
0  5.1  3.5  1.4  0.2
1  4.9  3.0  1.4  0.2
2  4.7  3.2  1.3  0.2
3  4.6  3.1  1.5  0.2
4  5.0  3.6  1.4  0.2
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 0 2
 2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]

SVD

singular-value decomposition
矩陣可拆分

Matrix rank

第一個row剪掉第二個row會等於第三個row
代表第三個row是多餘的，代表線性獨立的row只有兩個

Example_1

聽不懂啦幹
v : 選轉矩陣
$σ$ : 壓扁
U : 選轉矩陣

Example_2

對電影的評分
把數值最小的row及相對應的row或column砍掉
相似於原始矩陣
留下V
把五維轉成二維

How many singular values to keep?

盡量保持80%以上的資料

PCA

principal component analysis
把所有資料投影在垂直座標軸上

Example

Which projection axis is better

投影在紅線上好，還是綠線上好
投影在紅線上好，離散程度較高
- 較不會有兩點重疊
投影在w1軸上，離散程度要最大
投影在w2軸上，離散程度盡量大
- 避免跟第一個軸一樣，所以兩軸要垂直

PCA concept

找一個軸，離散程度最大的
再找下一個軸，並確保跟其他軸內積相成等於零
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Covariance

correlation

Covariance Matrix

Step

Example

Code(PCA)


















import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)
df = df.drop(7, axis = 1)

print(df.head())

x = np.array(df)
pca = PCA(n_components = 3)
pca.fit(x)

x_reduced = pca.transform(x)

print('singular values is {}'.format(pca.singular_values_))
print('after pca, all of data is reduced to 3D')
print(x_reduced)

0      1       2      3      4      5      6
0  15.26  14.84  0.8710  5.763  3.312  2.221  5.220
1  14.88  14.57  0.8811  5.554  3.333  1.018  4.956
2  14.29  14.09  0.9050  5.291  3.337  2.699  4.825
3  13.84  13.94  0.8955  5.324  3.379  2.259  4.805
4  16.14  14.99  0.9034  5.658  3.562  1.355  5.175
singular values is [47.49531899 21.09635322  3.92284041]
after pca, all of data is reduced to 3D
[[ 6.63448376e-01 -1.41732098e+00  4.12356541e-02]
 [ 3.15666512e-01 -2.68922915e+00  2.31726953e-01]
 [-6.60499302e-01 -1.13150635e+00  5.27087232e-01]
 [-1.05527590e+00 -1.62119002e+00  4.37015260e-01]
 [ 1.61999921e+00 -2.18338442e+00  3.33990920e-01]
 [-4.76938007e-01 -1.33649437e+00  3.55360614e-01]
 [-1.84834720e-01 -1.50364411e-01  1.41497264e-01]
 [-7.80629616e-01 -1.12979883e+00  2.79757608e-01]
 [ 2.28210810e+00 -1.36001690e+00 -3.50729413e-01]
 [ 1.97854147e+00 -1.49468793e+00 -2.93947251e-03]
 [ 3.69122947e-01  8.86722511e-01  1.13264978e-01]
 [-7.11021200e-01 -2.10663730e+00  1.37552595e-01]
 [-1.21370535e+00  9.46878939e-02  4.85809237e-01]
 [-1.16908541e+00 -7.42962899e-01  2.58209340e-01]
 [-1.19272176e+00 -9.53268162e-01  2.58450639e-01]
 [-5.08171207e-01  3.77958424e-01  6.56217572e-01]
 [-1.37469698e+00  1.32290559e+00  8.01997838e-01]
 [ 1.05726438e+00 -2.01562875e+00  4.33972249e-01]
 [-1.50961097e-01 -2.02235813e+00  7.52022386e-01]
 [-2.46241293e+00  7.37473835e-02  2.12996833e-01]
 [-6.31332100e-01 -7.18305655e-01 -5.43487434e-02]
 [-6.89698660e-01 -1.11182531e+00 -1.69906624e-02]
 [ 1.40769072e+00 -2.80658086e+00  3.14889098e-01]
 [-2.84267672e+00 -2.66880642e+00 -5.42113806e-02]
 [ 4.33268215e-01 -1.88984464e+00  1.04814954e-01]
 [ 1.81289158e+00 -2.60002176e+00  5.34196985e-02]
 [-2.02131332e+00 -6.08743328e-01  1.84384422e-01]
 [-2.19571862e+00 -1.49837622e+00  2.35406301e-02]
 [-7.44468841e-01 -1.06518721e+00  1.56860725e-01]
 [-1.50350480e+00 -3.68206745e-01 -8.59614612e-03]
 [-1.52075320e+00 -3.06180225e+00 -1.73200953e-01]
 [ 7.61190256e-01 -2.09488759e-01  1.65627857e-01]
 [-7.67738428e-01  1.26295451e-01 -1.20622472e-01]
 [-8.23965933e-01 -1.70715020e+00  5.32025512e-02]
 [ 4.39542396e-01 -1.52858534e+00 -5.57773440e-02]
 [ 1.52205298e+00 -1.25609762e+00  1.35317730e-01]
 [ 1.65240525e+00 -6.75119440e-01 -3.33822258e-03]
 [ 2.47674445e+00 -4.51537548e-01  3.09238192e-01]
 [ 1.15750673e-02 -5.96250977e-01  3.52544529e-02]
 [-1.11443822e+00  2.83345206e+00  5.68536979e-01]
 [-1.37160170e+00 -1.30108591e+00  3.86783538e-02]
 [-1.36349513e+00 -1.63960124e+00  7.21418337e-03]
 [-1.88302954e+00 -1.51985066e+00  4.14988118e-01]
 [ 6.29560566e-01  1.10062048e+00  6.34515325e-03]
 [ 2.84412124e-01 -5.60641213e-01  2.98249761e-01]
 [-9.60044753e-01 -2.29723526e+00  1.41107629e-01]
 [ 8.18964617e-01 -2.26570276e+00  1.53990241e-01]
 [ 1.96621301e-01 -7.40666486e-01  2.29592441e-01]
 [ 1.53276057e-02 -1.02061146e+00  2.02590478e-01]
 [ 2.54235169e-01 -1.55022082e+00 -1.05705755e-01]
 [-5.05384531e-01  1.97773604e-01  1.75366006e-01]
 [ 7.11973095e-01  1.96593925e+00  5.15648832e-01]
 [-3.55829381e-01  3.79579753e-01 -1.55069874e-01]
 [-5.66856332e-01 -4.55332330e-01  8.98393269e-02]
 [ 1.80968891e-02 -2.21678367e+00 -3.93360924e-01]
 [ 4.78424858e-01 -1.71349178e+00 -1.93034894e-01]
 [-3.75464636e-01 -9.76631316e-01  3.10946012e-01]
 [ 2.83883316e-01 -2.56463849e+00  2.83654402e-01]
 [ 7.69429731e-01 -1.63154473e+00  1.52399604e-01]
 [-2.77110124e+00 -2.60034941e+00  2.33323254e-01]
 [-3.80344820e+00 -1.51695365e+00  2.37149111e-01]
 [-4.00534905e+00 -1.97086343e+00  2.03609501e-01]
 [-2.87823982e+00 -8.86757382e-01  4.63038181e-01]
 [-1.87406423e+00  2.13355284e-01  7.89854003e-02]
 [-2.05089330e+00 -2.82503499e+00  1.19739519e-01]
 [-2.16820614e+00 -1.67334585e+00  4.54792497e-01]
 [-2.59673795e-01 -2.44512020e+00 -6.00299972e-02]
 [-7.07084427e-01 -1.59064814e+00 -5.52933377e-02]
 [-2.37114217e-01 -2.28118595e+00 -1.49729134e-01]
 [-2.28478840e+00 -4.60103809e-01 -1.17460180e-01]
 [ 3.16409179e+00  8.04126878e-01 -2.66641999e-01]
 [ 2.20955077e+00  1.27850851e+00 -1.57740826e-01]
 [ 2.62062165e+00  1.18221241e+00  3.71401616e-02]
 [ 4.76779625e+00 -1.57604721e-01  9.29992692e-02]
 [ 2.21232336e+00  6.01138596e-01 -1.39544203e-01]
 [ 2.07181452e+00  1.50200244e+00 -7.00860345e-02]
 [ 2.84275994e+00  5.04000805e-01 -2.39983894e-01]
 [ 6.46235442e+00  1.60085825e+00 -1.52794770e-01]
 [ 4.47829765e+00  1.97530486e+00 -3.07050833e-01]
 [ 2.61486732e+00 -5.12991394e-01  2.02041872e-02]
 [ 1.67836371e+00  2.07292392e+00 -4.87254221e-02]
 [ 4.03755261e+00  2.14071720e+00  3.50841784e-01]
 [ 5.71859177e+00  2.21396348e+00  1.90304067e-01]
 [ 5.58807242e+00 -1.50990711e+00 -3.07129758e-01]
 [ 5.32256756e+00 -5.11477550e-02 -1.35205623e-01]
 [ 4.00732381e+00 -7.29979610e-01 -3.00704179e-01]
 [ 4.70505178e+00 -1.45419923e+00 -9.90969343e-02]
 [ 4.79038389e+00  6.44967998e-01 -5.69065929e-01]
 [ 6.69570804e+00  2.94421273e+00  3.03912945e-01]
 [ 6.46037602e+00  2.15264305e+00  2.01205274e-01]
 [ 6.14337744e+00 -9.43926428e-01 -4.15476808e-01]
 [ 4.39514760e+00 -1.61012778e-02 -1.54841426e-03]
 [ 4.46139621e+00  1.12624509e-01 -7.95827310e-02]
 [ 3.78494886e+00  2.79030096e+00  3.90137034e-01]
 [ 4.01629279e+00  1.80247983e+00 -6.82260736e-01]
 [ 2.38051551e+00  3.23431649e-01 -3.38653786e-01]
 [ 5.03718740e+00  4.35062776e-01 -1.48395998e-01]
 [ 4.92049767e+00 -8.97804759e-01 -6.05902714e-01]
 [ 3.94065862e+00 -3.15810290e-01 -4.91695711e-01]
 [ 4.53327546e+00 -9.29513362e-01 -2.00722187e-01]
 [ 1.65697361e+00  7.28440774e-01  1.41627453e-01]
 [ 3.63867456e+00 -1.18043005e+00  6.80825411e-02]
 [ 4.97852528e+00  1.24163180e+00  2.63041542e-01]
 [ 4.94143745e+00  3.05328003e-01 -2.47908931e-01]
 [ 4.63589248e+00  2.70940363e-01 -1.14295115e-01]
 [ 4.52405499e+00 -5.83403419e-01  1.38095195e-01]
 [ 4.51574435e+00 -2.71311394e-01 -8.74599219e-02]
 [ 3.12281151e+00  4.56212973e-01 -8.62751578e-02]
 [ 5.83137312e+00  3.30385849e-01 -4.76189658e-01]
 [ 4.35718398e+00 -1.41741787e+00 -6.25084495e-02]
 [ 4.15756069e+00 -9.50833185e-01  9.65051938e-02]
 [ 5.08259001e+00  6.24704579e-01  6.28827972e-02]
 [ 4.89139771e+00 -9.82917842e-01  1.28275950e-01]
 [ 4.44322173e+00  3.57224575e+00  1.56974171e-01]
 [ 6.67155248e+00  1.84060496e+00  9.12010421e-02]
 [ 4.90749190e+00 -8.18130733e-01 -2.55684480e-01]
 [ 4.37369427e+00  1.17679493e+00  4.41493473e-01]
 [ 4.87191916e+00  1.48790820e-02 -9.56013893e-02]
 [ 4.44858857e+00  5.06661089e-01  9.35385490e-02]
 [ 5.88457254e+00  1.27049077e-01 -1.80453388e-01]
 [ 5.68383523e+00  2.94064884e+00  2.60580298e-01]
 [ 3.70570300e+00  4.03190365e-01 -1.09288956e-01]
 [ 1.48852809e+00  7.87950654e-01 -8.22192480e-02]
 [ 3.98326071e+00 -2.15055765e-01  1.49554483e-01]
 [ 1.15533338e+00 -2.55675983e-01  5.98162419e-01]
 [ 4.23450934e+00  1.03173593e+00  1.64406843e-01]
 [ 4.21700578e+00  1.24930690e+00 -1.56980693e-01]
 [ 3.62299759e+00 -9.85552845e-01 -1.99064864e-02]
 [ 6.17386444e+00 -1.00395914e+00 -1.89332536e-01]
 [ 2.71378793e+00  2.00947941e+00  3.92678397e-01]
 [ 3.86089703e+00 -3.73513274e-01  8.03597573e-02]
 [ 4.61500274e+00 -2.10261231e-01  9.79900380e-02]
 [ 5.92114800e-01  8.66320178e-01 -3.00484271e-01]
 [ 1.48591457e+00  7.74439749e-01 -1.72450887e-01]
 [ 6.90660085e-01  1.38976691e+00 -1.68271877e-01]
 [ 5.30993584e-01 -4.14621634e-02  2.10574049e-01]
 [ 2.89265049e+00  2.11576887e-01 -2.20244499e-01]
 [ 1.10275966e+00 -8.95136177e-01 -5.29279898e-01]
 [ 1.08106342e+00 -8.23294917e-01 -3.54727287e-01]
 [ 1.58041453e+00  2.92679166e-01 -2.25284288e-01]
 [-2.08047727e+00  1.36508853e+00 -1.99716808e-01]
 [-2.04891061e+00  3.11005161e+00 -6.38273447e-02]
 [-1.93112897e+00  2.06805453e+00  3.42899489e-02]
 [-3.14761497e+00  1.38674593e+00 -2.02387013e-02]
 [-3.35755431e+00  3.62389743e-01 -2.78959609e-01]
 [-4.22241525e+00  1.97238736e+00 -3.44133543e-01]
 [-3.55211446e+00 -1.92651406e+00 -3.78402644e-01]
 [-2.74248901e+00  3.68275745e-01  9.39425325e-02]
 [-2.26028697e+00 -7.15757433e-01 -3.01514983e-01]
 [-4.59256683e+00  1.21366537e+00 -4.10485088e-01]
 [-3.49115131e+00  1.07925934e+00 -2.38194431e-01]
 [-3.44038060e+00  2.89298232e+00 -2.07485411e-01]
 [-2.88398331e+00  7.17984884e-01 -3.58805444e-01]
 [-3.96469498e+00 -8.67031460e-01 -2.74174690e-01]
 [-3.85801218e+00 -1.19622949e-01 -3.44490680e-01]
 [-4.23854664e+00  1.60807548e+00 -2.98762740e-01]
 [-3.89609307e+00 -8.50339426e-01 -6.58059699e-02]
 [-2.98607257e+00  7.68415914e-01 -3.42792920e-01]
 [-3.33747668e+00  2.84753275e-01 -5.22692614e-01]
 [-3.83091968e+00  1.23660236e+00 -3.79258301e-01]
 [-2.36752198e+00 -8.93936559e-01 -5.13938956e-01]
 [-3.15770694e+00  1.92519249e-01 -3.20706760e-01]
 [-3.23143479e+00  8.85496109e-01 -4.51321791e-02]
 [-2.61469342e+00  3.94920094e-01 -8.06497416e-02]
 [-4.49824136e+00  2.13615465e+00  6.33923291e-02]
 [-2.94329647e+00 -1.88566242e+00 -4.82086609e-02]
 [-2.76609970e+00  8.91774133e-01 -1.72181504e-01]
 [-2.89907369e+00 -4.09293999e-01 -4.02323115e-01]
 [-3.90252444e+00  1.58340894e-01 -2.77227460e-01]
 [-3.95458414e+00 -6.73045369e-01 -2.41546685e-01]
 [-4.52101870e+00  2.49812224e+00 -2.49745319e-01]
 [-4.04118364e+00  2.51581412e+00  1.29405374e-01]
 [-4.04672999e+00  1.00738290e-01 -9.09399508e-02]
 [-4.03387689e+00  1.39431425e+00 -9.26707967e-02]
 [-4.51655795e+00  9.40408717e-01 -4.06201269e-01]
 [-4.67880742e+00  4.91842458e-01 -5.80895287e-02]
 [-4.15214114e+00  1.12758416e+00 -1.65527388e-01]
 [-4.67134302e+00  4.21033845e-01 -1.71844958e-01]
 [-4.01783124e+00  1.67978706e+00  3.14971477e-03]
 [-2.60793947e+00 -2.37304506e+00 -3.55362015e-01]
 [-4.03445958e+00  7.40517346e-01  1.30612374e-01]
 [-2.84072766e+00  9.33512708e-01  5.46936118e-02]
 [-3.09276593e+00  7.75654915e-01 -5.61725952e-02]
 [-3.75632591e+00  1.04703805e+00 -4.29715405e-02]
 [-2.41503022e+00  2.20440679e+00 -8.67431609e-02]
 [-3.57450320e+00 -7.19443063e-02 -4.01721420e-01]
 [-3.37275950e+00  8.03901892e-01 -4.60390433e-01]
 [-4.43114913e+00 -7.75918129e-02 -1.41347439e-01]
 [-4.55059431e+00  3.26576882e+00  1.99485527e-01]
 [-5.00252227e+00  6.36766609e-01  1.71812058e-01]
 [-4.55827661e+00  1.13664276e+00 -9.51701336e-02]
 [-4.04374701e+00 -2.25814885e-01 -6.87186458e-02]
 [-3.36161741e+00 -5.27869965e-01 -4.74331964e-02]
 [-4.56094950e+00  5.95574083e-01 -2.83496804e-01]
 [-3.11855689e+00  3.31944329e-02  3.57435656e-02]
 [-2.52946420e+00  8.37109623e-01  3.64003383e-01]
 [-2.58655448e+00  1.44846001e+00  3.01071387e-01]
 [-1.83336074e+00  7.30693089e-01  2.15840207e-01]
 [-2.36059318e+00 -6.86820620e-01 -2.53316928e-01]
 [-2.35819688e+00 -1.20488410e+00  3.56048199e-01]
 [-2.97998867e+00  1.39806029e+00  1.31248700e-01]
 [-2.41873891e+00 -1.74952111e+00  4.09312427e-01]
 [-4.21924202e+00 -1.94251854e-01  1.18503747e-01]
 [-3.08869155e+00  4.37638669e+00  4.99058361e-01]
 [-2.78962325e+00 -1.41941777e-01  5.02954010e-02]
 [-3.04187227e+00 -4.73126171e-01  1.95045363e-01]
 [-4.10906270e+00  1.09340872e-01 -8.74005598e-02]
 [-2.50003394e+00  4.30796502e+00  5.32818431e-01]
 [-3.33207854e+00 -5.25289746e-01 -9.81079349e-02]
 [-3.10755116e+00  1.54975743e+00  1.21282793e-01]]

Kaggle 實戰

https://www.kaggle.com/c/titanic

Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

Metric

Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Submission File Format

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order)
Survived (contains your binary predictions: 1 for survived, 0 for deceased)

PassengerId,Survived
892,0
893,1
894,0
Etc.

匯入資料



train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]

觀察資料


print(train_df.columns.values)

Print a concise summary of a DataFrame.



train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Generate descriptive statistics.


train_df.describe()

評分標準


round(model.score(X_train, Y_train) * 100, 2)

匯出CSV






submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
    
submission.to_csv('submission.csv', index=False)

機器學習

tags: 數據分析 機器學習

機器學習概論

依照輸入分類

Supervised Learning

Unsupervised Learning

Semi-supervised Learning

Reinforcement Learning

依照輸出分類

Regression

Classification

Machine learning workflow

資料輸入

特徵工程

資料缺失值

極端值

Split data

Normalization

Select model

Validate trained model - regression

MSE

R squared(coefficient of determination)

Validate trained model - classification

accuracy=#ofcorrectprediction#ofdata

accuracy=TP+TNTP+FP+FN+TN

F1-score

ROC

Confusion matrix

Bias-Variance Tradeoff

No free lunch theory

輸出

應用領域

Regression Supervised Learning

Linear Regression

Caes

Suppose 房價只跟屋齡有關

可以依照data畫出多條趨勢線

利用cost function 找到最適合（誤差最小的）的趨勢線

微積分求解

Multivariate Linear Regression Models

Gradient Descent

Gradient Descent V.S. Normal Equation

Overfitting

Code

Polynomial Regression

nth-degree Polynomial Regression

Graph

交叉項(cross term)

Logistic Regression

Sigmoid function

Cross-entropy)(Cost Function)補)

Information（資訊含量）

Entropy V.S. Cross-entropy

Entropy

Cross-entropy

Example

Cost Function(cross-entropy)(問號)

Learning

Code

Classification Supervised Learning

K-Nearest Neighbor

Step

How to Define Distance

How to choose K

Curse of dimensionality

Code

Decision Tree

How to split on each node?

How to define a good split?

CART

Gini Impurity

Exampl_1

Example_2

ID3

Entropy

Example_1

Example_2

優點

Step

Pruning

tags: `數據分析` `機器學習`

$a c c u r a c y = \frac{# o f c o r r e c t p r e d i c t i o n}{# o f d a t a}$

$a c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}$