# 機器學習
###### tags: `數據分析` `機器學習`
## 機器學習概論
### 依照輸入分類
#### Supervised Learning
* 輸入資料有答案(label)
#### Unsupervised Learning
* 輸入資料沒答案(label)
* Cluster unlabelled data
* 讓電腦把資料相似的分類在一起,相異的放遠一點
#### Semi-supervised Learning
* 一半的資料有答案,一半的資料沒答案
#### Reinforcement Learning
* 給她沒答案的資料
* 如果答對給予正回饋,答錯就給負回饋
### 依照輸出分類

#### Regression
* 預測的東西是一個連續的數值
* ex:
* 股價預測
* 職棒比賽,確切比數
* 預測精確數值,難度較高
#### Classification

* 預測的東西是一個類別
* ex:
* 股價漲或跌的預測
* 職棒比賽,誰贏誰輸
* 不需要精確預測,難度較簡單
### Machine learning workflow

#### 資料輸入
#### 特徵工程
##### 資料缺失值
* 數字型的
* 整筆資料不要用
* 填平均值或中間值
* 類別型的
* 取眾數
* 設立一個其他類別
##### 極端值
* 刪除極端值
##### Split data
* Training data
* 訓練資料
* 通常80%
* 通常大於testing data
* Testing data
* 測試資料
* 通常20%
* 通常小於traing data
##### Normalization

* 數量級差太多,學到的參數的數量級會相差非常大
* 讓每個值的欄位壓到0跟1之間或-1跟1之間
* 常用方法
* min_max
* [0,1]
* $\dfrac{x_i - x_{min}}{x_{max} - x_{min}}$
* Z-Score Standardization
* [-1,1]
* $\dfrac{x_i - \mu}{\sigma}$
#### Select model

##### Validate trained model - regression
* 拿來評量模型表現好不好
###### MSE
* $MSE = \dfrac{1}{n}\sum\limits_{i = 1}^n{(f_i - y_i)^2}$
* $f_i$: 預測值
* $y_i$: 實際答案
* n: 數值數量
* 值越小,代表誤差越小
###### R squared(coefficient of determination)
* 觀念問號
* $R ^ 2 = 1 - \dfrac{\sum\limits_{i = 1}^n{(f_i - \bar{y}) ^ 2}}{\sum\limits_{i = 1}^n(y_i - \bar{y})}$
* $y_i$: 實際答案
* $f_i$: 預測值
* $\bar{y}$: 答案的平均
* 預測的離散程度跟答案的離散程度是否相近
* 越接近1代表表現越好,值越小表現越爛
##### Validate trained model - classification
###### $accuracy = \dfrac{\# of correct prediction}{\# of data}$

###### $accuracy = \dfrac{TP + TN}{TP + FP + FN + TN}$

* True Positive (TP)「真陽性」:真實情況是「有」,模型說「有」的個數。
* True Negative(TN)「真陰性」:真實情況是「沒有」,模型說「沒有」的個數。
* False Positive (FP)「偽陽性」:真實情況是「沒有」,模型說「有」的個數。
* False Negative(FN)「偽陰性」:真實情況是「有」,模型說「沒有」的個數。
* $Precision = \dfrac{TP}{TP + FP}$
* $Recall = \dfrac{TP}{TP + FN}$
###### F1-score
* $F1 = 2 * \dfrac{1}{\dfrac{1}{recall} + \dfrac{1}{precision}} = 2 * \dfrac{precision * recall}{precision + recall}$
* 越接近1表現越好,越接近0越差
###### ROC
* 針對不同的門檻值去畫出ROC curve
* AUC(area under curve)
* AUC=0.5 (no discrimination 無鑑別力)
* 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力)
* 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力)
* 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力)
###### Confusion matrix

* 把二元分類變多元分類
* 對角線上是猜對的
##### Bias-Variance Tradeoff

* Variance : 穩定與不穩定
* Bias : 資料集不集中

* 選擇模型太簡單(變數太少之類的),那會落到high bias的區域
* 選擇模型太複雜,會落到high variance的區域,會不穩定
* overfitting
##### No free lunch theory

* 沒有任何一個演算法可以勝過其他所有的演算法
* 根據不同問題有不同的解法
#### 輸出
### 應用領域
* 自動駕駛
* 支付
* 人臉辨識支付
* 醫療應用
* 腫瘤偵測
* 機器人理財
* 無人機
* 自動避障
* 預測推銷
* youtube推薦演算法
* 智慧工廠
* 語音智慧助理
* siri
## Regression Supervised Learning
### Linear Regression
* 依現有的data,找到一條趨勢線,預測未知的data
* 線性解
#### Caes
##### Suppose 房價只跟屋齡有關

##### 可以依照data畫出多條趨勢線

##### 利用cost function 找到最適合(誤差最小的)的趨勢線
$Cost = \sum\limits_{i = 1} ^ m{(y_i - \bar{y_i}) ^ 2}$

##### 微積分求解

#### Multivariate Linear Regression Models


* 多變量時難以用公式解,變數太多了
* 通常用gradient descent
##### Gradient Descent

* 只能找區域最小值
* learning rate太小會算很久,太大找不到最佳值,通常用0.1 到 1 之間

##### Gradient Descent V.S. Normal Equation

* Gradient Descent
* learning rate要剛剛好
* 要迭代很多次
* n大時會比較好(資料量大)
* Normal Equation
* 不用選learning rate
* 要計算矩陣逆運算,較麻煩
* n大時算很慢(因為矩陣逆運算)
##### Overfitting
* 最左(underfitting)
* 無法找到理想趨勢
* 中間(理想狀況)
* 最右(overfitting)
* 在train data上fit的很好,但有新東西就gg了
* 像死讀書,刷考古題的學生
* 變數越多越容易overfitting

#### Code
```python=
from sklearn import preprocessing, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
# 資料輸入
df = pd.read_csv('./dataset/housing.csv', header = None, delim_whitespace=True)
# 答案取出
y = df[13]
x = df.drop(13, axis = 1)
# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1)
# Normalization
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Model Select
model = linear_model.LinearRegression()
model.fit(x_train, y_train)
# Predict
y_pred = model.predict(x_test)
print('Cofficient : {}'.format(model.coef_))
print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred)))
print('Variance score : {}'.format(r2_score(y_test, y_pred)))
```
```
Cofficient : [-1.02554944 0.96553896 0.16729159 0.58076865 -2.02969655 2.57536082
0.17046044 -2.84987298 2.50778431 -1.85852862 -2.05829633 0.82864609
-3.86813384]
Mean squared error : 34.403276579602064
Variance score : 0.6739078917414478
```
### Polynomial Regression
* 非線性解
* 低維度變高維度
#### nth-degree Polynomial Regression
* n次多項式,當作多個feature

* 其他model

#### Graph

#### 交叉項(cross term)
* 兩變數有加減乘除的關係(多為乘法)

* 原本兩個feature,經由相乘或是次方,多出很多個feature

```python=
import numpy as np
import pandas as pd
from sklearn import linear_model, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
# 匯入檔案
df = pd.read_csv('./dataset/winequality-red.csv')
# 處理answer and data
y = df['quality']
x = df.drop('quality', axis = 1)
# 產生degree 為 2 的feature
poly = PolynomialFeatures(degree = 2).fit(x)
x = poly.transform(x)
# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
# Normalization
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Select model
model = linear_model.LinearRegression()
model.fit(x_train, y_train)
# Predict
y_pred = model.predict(x_test)
# 查看係數
print('The coefficient : {}\n'.format(model.coef_))
print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred)))
print('Variance score : {}'.format(r2_score(y_test, y_pred)))
```
```
The coefficient : [-2.20439093e-11 -3.32036477e+01 -3.46762494e+01 -1.98569253e+01
-1.40652656e+01 -6.85273381e+01 -7.74114997e+01 1.00727869e+02
-2.50880009e+01 -7.93478035e+01 4.63945174e+01 -6.70038809e+00
-1.07912073e+00 -4.28103408e-01 -1.80118094e-01 -5.27741073e-01
-9.30638006e-01 -6.23191815e-01 5.34638897e-01 3.69394013e+01
-1.63849368e+00 6.25798078e-01 6.86240624e-02 -1.07543978e-01
3.25708843e-02 -6.49414077e-02 2.15989372e-01 -5.66509424e-02
3.43575308e-01 3.51401508e+01 -1.05497188e+00 -2.55960941e-01
8.98178254e-01 -3.09937110e-02 9.34787505e-02 8.98840857e-02
1.48808583e-01 -3.20881226e-02 2.21970228e+01 -3.13608092e+00
-3.12173913e-01 1.11046257e+00 -1.59627758e-01 -2.03532456e-01
3.43111110e-02 -6.51049711e-04 1.79109779e+01 -2.69409190e+00
1.29726359e-02 -3.65247752e-01 6.44755908e-02 -5.20131174e-02
-1.18159029e-01 6.92374227e+01 -5.56852862e-01 1.09112468e-01
5.56498217e-01 -1.63840021e-01 -3.70231215e-02 7.79292790e+01
-5.47020463e-01 -4.70828794e-01 1.29740195e+00 8.93119391e-02
-1.01689650e+02 1.28353602e+00 4.09945512e-01 -1.61154603e+00
2.10910693e+01 8.08256509e+01 -4.72908479e+01 4.26542501e+00
-2.92688018e+00 1.05074947e+00 2.71306226e+00 -3.52447248e-01
7.49136902e-02 -1.16178316e-01]
Mean squared error : 0.4308785588449687
Variance score : 0.29206509289757054
```
### Logistic Regression
* X : feature
* $\theta$ : 參數

#### Sigmoid function
* Output is [0, 1]
* model的輸出常被拿來當作機率

#### Cross-entropy)(Cost Function)補)
* 給電腦兩個機率分佈,分析兩個機率分佈相不相似
* 兩機率分佈幾乎一樣,算出來越小
$C(\theta_0, \theta_1, ... , \theta_n) = \dfrac{1}{2m} \sum\limits_{i = 1} ^ m{}$
#### Information(資訊含量)
* pi : 發生某事的機率
* Define Information : $log({\dfrac{1}{pi}})$
* 當某事機率越大,資訊含量越低
* 太陽從東邊升起的機率是100 %
* Information = $log({\dfrac{1}{1}}) = 0$
* 當某事機率越小,資訊含量越高
* 明天下雨的機率是50 %
* Information = $log({\dfrac{1}{\dfrac{1}{2}}}) = 0.3$
#### Entropy V.S. Cross-entropy
##### Entropy
* 資訊含量的期望值
* 資訊含量越不確定時越高,越確定時越低

##### Cross-entropy
* 輸入兩個機率分佈
* 衡量兩個機率分佈相不相近

#### Example

* Cross entropy 通常大於entropy

#### Cost Function(cross-entropy)(問號)
* h 與 1 - h都是機率分佈
* y 跟 1 - y,非一即零,

#### Learning
* Gradient descent
### Code
```python=
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
pima = pd.read_csv('./dataset/pima-indians-diabetes.csv')
#x = pima[['pregnant', 'insulin', 'bmi', 'age']]
y = pima['label']
x = pima.drop(['label'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model = LogisticRegression()
model.fit(x_train, y_train)
print(model.coef_)
print(model.intercept_)
y_pred = model.predict(x_test)
print(y_pred)
accuracy = model.score(x_test, y_test)
print(accuracy)
```
```
[[ 0.33498141 1.03029784 -0.30681406 -0.02318156 -0.07418398 0.67294487
0.2001099 0.20383277]]
[-0.90342628]
[0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0
0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0
1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0
0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0]
0.7835497835497836
```
## Classification Supervised Learning
### K-Nearest Neighbor
* 可同時classification跟regression
* 通常用來classification
* 無變數的
* 只需要決定k值
* k : 有多少鄰近的類別
* k = 1
* 最近的一個是正方形,所以他是正方形類別
* k = 3
* 最近的三個是一個正方刑跟兩個三角形,三角形較多,故為三角形類別
* k = 7
* 最近的七個是四個正方刑跟三個三角形,正方形較多,故為正方形類別

#### Step
1. Look at data
* 把原本的資料做好分類

2. Caculate distances
* 將每筆data跟所求data的距離求出來

3. Find neighbors
* 找最近的k個資料

4. Vote from labels
* 最後做投票
* 看哪一個類型最多

#### How to Define Distance

* Manhattan distance(L1 distance)

* Euclidean Distance(L2 distance)

#### How to choose K
* 通常取奇數
* 必可以選出來,不會有相等的狀況
* K is small
* 易受不好的資料影響

* K is large
* 不小心預測出多數的類別
* 永遠預測到同一個類別

#### Curse of dimensionality
* 利用次方增維

* 在高維度較容易分類

* 如果太高維,會造成overfitting
#### Code
```python=
import pandas as pd
import numpy as np
from sklearn import preprocessing, datasets, neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)
y = df[7] - 1
x = df.drop(7, axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model = neighbors.KNeighborsClassifier(n_neighbors = 5)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
num_correct_samples = accuracy_score(y_test, y_pred, normalize = False)
print('number of correct samples : {}'.format(num_correct_samples))
print('accuracy : {}'.format(accuracy))
```
```
number of correct samples : 39
accuracy : 0.9285714285714286
```
### Decision Tree
* 把決策過程變成一棵樹的結構

#### How to split on each node?

#### How to define a good split?
* 決策效果極明顯的,比如如果A就一定Yes,反之則No,則A就是一個好的node
* gain = 分割前 - 分割後
* gain越大越好
#### CART
* Classification and Regression Trees(CART)
* Binary tree
* Gini為分割良好指標

##### Gini Impurity
* 公式

* 越勢均力敵時,gini越大,反之,gini越小
* gini越小越好


##### Exampl_1
* 分割後,求出gini的期望值

* A
* 0.5 - 0.4852 = 0.015
* B
* 0.5 - 0.37 = 0.13
* B較佳

##### Example_2
* 遇到數值型的
* 取平均當分界線(在這邊是80)


#### ID3
* Iterative Dichotomiser 3(ID3)
* 可以有多個分支

##### Entropy
* 公式

* 越勢均力敵,entropy越大,反之,entropy越小
* Entropy 越小越好


##### Example_1
* 分割後,求出entropy的期望值

* A
* 0.301 - 0.294 = 0.007
* B
* 0.301 - 0.242 = 0.069
* B較佳

##### Example_2
* Before splitting

* After splitting

* 對所有欄位都split看看,取最好的

* Overcast 的狀況都會是yes,所以就不需要決策node了

* 重複前面的動作,找出其他的節點

* 完成

##### 優點
* 寫成ifelse,可解釋性高
##### Step

##### Pruning
* 為了分類,可能會overfitting,決策樹長很深
* Pre-prunung
* 設一些條件,如果達到條件就不往下長
* 如果node所含的資料少於特定大小,就不繼續往下長
* Post-prunung
* 長完後,取自己想要的部分
#### Code(CART)
```python=
import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('./dataset/abalone.csv', header = None)
Mean = np.mean(df[8])
df[0] = pd.Categorical(df[0]).codes
df[8] = df[8].apply(lambda x : 0 if x > Mean else 1)
y = df[8]
x = df.drop(8, axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model = DecisionTreeClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
num_correct_samples = accuracy_score(y_test, y_pred, normalize = False)
con_matrix = confusion_matrix(y_test, y_pred)
print('number of correct sample : {}'.format(num_correct_samples))
print('accuracy : {}'.format(accuracy))
print('con_matrix : {}'.format(con_matrix))
```
```
number of correct sample : 595
accuracy : 0.7117224880382775
con_matrix : [[318 118]
[123 277]]
```
### Naive Bayes
* 機率分類器
* 條件機率
* 資料量大時很適合用Naive Bayes
* 特徵越多會呈線性成長
* 文本分類
#### How Naive Bayes Classifier Work
* 看哪邊機率大,就屬於那一邊

* 轉成數學式

* 簡化後
* 看哪邊機率比較大,就屬於哪一類

#### Example
* 想知道句子是否跟運動有關

* 想知道"a very close game"是否跟運動有關

* 推導後

##### How to calculate term?
* $p(game|Sports) = \dfrac{2}{11}$
* 11 : 在sports的條件下,有十一個字
* 2 : 在sports的條件下,出現過2次games
* 竟量不要有0的狀況,因為這樣機率直接變零了

##### Laplace smoothing
* 為了避免出現機率0的狀況,導致條件機率為零,導致無意義的狀況發生
* 分母加整個文本有多少不同的字
* 分子加一
* 經過Laplace smoothing的算法後,就稱之為Multinomial Naive Bayes

* 開算囉

* 帶入條件機率的式子
* sports > not sports,所以猜他是sports類別

#### Different probability assumption
* 假設機率都是高斯的常態分佈

##### Example
* 利用一些體態,預測是男生還是女生

* 算平均與標準差

* 下大於上,故猜測是female

#### Code
```python=
import numpy as np
import pandas as pd
import time
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
df = pd.read_csv("./dataset/titanic/train.csv")
df['Sex'] = pd.Categorical(df['Sex']).codes
df['Embarked'] = pd.Categorical(df['Embarked']).codes
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
df = df.dropna(axis = 0, how = 'any')
y = df['Survived']
x = df.drop('Survived', axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
model = GaussianNB()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print('Number of misabeled points out of total {} points : {}, performance {:05.2f}%'
.format(
x_test.shape[0],
(y_test != y_pred).sum(),
100 * (1 - y_test != y_pred).sum() / x_test.shape[0])
)
accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)
print('number of correct sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))
```
```
Number of misabeled points out of total 215 points : 55, performance 74.42%
number of correct sample : 160
accuracy : 0.7441860465116279
```
### Random Forests
* ensemble learning
* 把多個分類器並在一起
* 建立多棵dicision tree
* 看哪一個類別多,他就是那個類別

#### Example
* random 選取N個row,得到許多sub_data

* 針對每個data,在隨機選取k個特徵

* 將這k個特徵,拿去做dicision tree(CART)

* 每筆sub_data都這樣做,會得到很多dicision tree

* 接著看這n棵樹,判斷哪一個類別比較多,他就是那個類別
#### Code
```python=
import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)
y = df[7]
x = df.drop(7, axis = 1).copy()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model = RandomForestClassifier(max_depth = 6, n_estimators =15)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)
print('number_correct_sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))
```
```
number_correct_sample : 39
accuracy : 0.9285714285714286
```
### Support Vector Machine
* 找到一個超平面,讓資料間margin最大
* H3最好,離兩個資料群最近的點最遠

* supprot vector
* 兩個類別最近的點
* support hyperlane
* 通過兩個類別最近的點的平面
* margin
* 兩條supprt hyperlane的距離

* xi : 資料特徵
* yi : 類別{+1, -1}




* 在分母很難處理,把它變到分子,求最大值改成求最小值

* Hard cost
* 百分之百不允許有人越界
* 一定會找到一個平面把兩個類別分開來
* soft cost
* 允許有人越界
* $\epsilon_i$ : 越界多少,越界的程度
* C : 懲罰程度
* 越大代表越不允許越界
* 越小代表越允許越界

#### Linear v.s. nonlinear problems

##### SVM Kernel Trick
* 用某種函數把維度提升
* 提高維度就有可能分類

##### Example
* 二維變三維

* 另外一個最佳化問題比較好算,才返回去算原來的最佳化問題

##### Common Kernel in SVM

#### Multi-class in SVM
* 如果有kclass
* Method 1 :one-against-rest
* 每一個svm判斷是不是該類別
* 產生k個二元svm來做分類

* Method 2 :one-against-one
* 每一個svm判斷是第m個還是第n個類別
* 共會有$\dfrac{k(k - 1)}{2}$種

df=pima[['pregnant', 'insulin', 'bmi', 'age', 'label']]
x=df[['pregnant', 'insulin', 'bmi', 'age']]
y=df['label']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
model = SVC(kernel = 'rbf')
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
num_correct_sample = accuracy_score(y_test, y_pred, normalize = False)
print('num_correct_sample : {}'.format(num_correct_sample))
print('accuracy : {}'.format(accuracy))
```
```
num_correct_sample : 160
accuracy : 0.6926406926406926
```
## Classification Unsupervised Learning
### K-means
* k : 把這些資料分成k個類別(因為unsupervised沒類別的概念,所以要自己分類)
* 1.
* 隨便取k比資料,分成k類
* 再將相近的資料,歸類為該類別
* 2.
* 算出新的重心
* 3.
* 找出同類別的資料的重心
* 4.
* 再重新用重心分類一次

* 5. 不斷repeat上面四個步驟,直到收斂不改變

```python=
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
data = pd.read_csv('./dataset/xclara.csv')
print("Input Data and Shape")
print(data.shape)
f1 = data['V1'].values
f2 = data['V2'].values
x = np.array(list(zip(f1, f2)))
model = KMeans(n_clusters = 3)
model = model.fit(x)
labels = model.predict(x)
centroids = model.cluster_centers_
print('centroids : {}'.format(centroids))
print('prediction on each data : {}'.format(labels))
labels = model.predict(np.array([[12.0, 14.0]]))
print('prediction on data point (12.0, 14.0) : {}'.format(labels))
```
```
Input Data and Shape
(3000, 2)
centroids : [[ 69.92418447 -10.11964119]
[ 40.68362784 59.71589274]
[ 9.4780459 10.686052 ]]
prediction on each data : [2 2 2 ... 0 0 0]
prediction on data point (12.0, 14.0) : [2]
```
### DBSCAN
* Density-based spatial clustering of applications with noise
* 以密度來衡量哪個資料屬於哪個類別
* 優點:他很能抵抗noise data
* 藍色區域密度大被歸為一類,紅色同理

#### Terminlogy in DBSCAN
* Density : 密度
* 指定某半徑,能匡到的資料數目即為密度
* core point : 核心點
* 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果超過我所設定最少量的資料點,就稱為core point
* border point :
* 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果少於我所設定最少量的資料點,但其中有匡到core point,就稱為border point
* noise point :
* 非core point 且非 border point

#### Directly Reachable
* 假設有點p跟點q,兩點間連一條直線,經過的點不是core point就是border point,就稱為directly reachable
* core point及附近的border point會被歸為同一類

#### Example

* Directly reachable會被歸為一類

#### problem
* 半徑沒設好會有一些問題

#### Code
```python=
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
df = pd.read_csv('./dataset/iris.csv', header = None)
df = df.drop(4, axis = 1)
print("Input Data and Shape")
print(df.head())
print(df.shape)
x = np.array(df)
model = DBSCAN(eps = 0.3, min_samples = 5).fit(x)
labels = model.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print('number of clusters : {}'.format(n_clusters))
print('cluster on x {}'.format(labels))
```
```
Input Data and Shape
0 1 2 3
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
(150, 4)
number of clusters : 3
cluster on x [ 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 -1 -1 0 -1 0 -1 0 -1 0
0 0 0 0 0 0 0 -1 -1 -1 0 0 -1 0 0 0 0 -1 0 0 -1 0 0 0
0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 2 -1
-1 -1 -1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 -1 1 1
1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 2]
```
### EM
* Expectation-Maximization
* 藉由迭代,找到機率模型裡的參數
#### Maximum Likelihood Estimation
* 給予一個機率分佈,想要找出一些參數

* 數字overflow,所以取log處理

* 估計整棵樹蘋果的平均重量

* 將值帶入

* 取log

* 母體平均大約等於隨機抽樣的平均

#### Likelihood Estimation Table


#### Example
* 欲將資料分為兩個類別

* 隨機另兩個高斯曲線

##### E step
* 一剛開始的狀況,因為是隨便給的,所以分得很爛
* 計算資料隸屬於黃色的比例跟隸屬於藍色的比例
* 最左邊的資料 : 70 % 是黃色類別,30 % 是藍色類別
* 左二 : 40 % 是黃色類別,60 % 是藍色類別
* 左三 : 10 % 是黃色類別,90 % 是藍色類別

##### M step
* 去修正機率模型的參數(平均跟標準差)
* 計算likelyhood,更新平均跟標準差,畫出新的高斯分佈

##### 完成

#### 缺點
* 資料可能不是高斯常態分佈
* 可以使用其他分佈
### Different cluster method

## Dimension Reduction
* unsupervised
* 讓資料壓縮
* reduce time complexity
* reduce space complexity
* 增加可視化程度
### Illustraion


#### Code
```python=
import pandas as pd
import numpy as np
from sklearn import metrics, mixture, preprocessing
df = pd.read_csv('./dataset/iris.csv', header = None)
df = df.drop(4, axis = 1)
print('Input Date and Shape')
print(df.head())
x = np.array(df)
model = mixture.GaussianMixture(n_components = 3).fit(x)
x_pred = model.predict(x)
print(x_pred)
```
```
Input Date and Shape
0 1 2 3
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 0 2
2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]
```
### SVD
* singular-value decomposition
* 矩陣可拆分

#### Matrix rank
* 第一個row剪掉第二個row會等於第三個row
* 代表第三個row是多餘的,代表線性獨立的row只有兩個

#### Example_1
* 聽不懂啦幹



* v : 選轉矩陣
* $\sigma$ : 壓扁
* U : 選轉矩陣

#### Example_2
* 對電影的評分


* 把數值最小的row及相對應的row或column砍掉


* 相似於原始矩陣

* 留下V
* 把五維轉成二維

#### How many singular values to keep?
* 盡量保持80%以上的資料
### PCA
* principal component analysis
* 把所有資料投影在垂直座標軸上

#### Example

#### Which projection axis is better
* 投影在紅線上好,還是綠線上好
* 投影在紅線上好,離散程度較高
* 較不會有兩點重疊

* 投影在w1軸上,離散程度要最大

* 投影在w2軸上,離散程度盡量大
* 避免跟第一個軸一樣,所以兩軸要垂直

#### PCA concept
* 找一個軸,離散程度最大的
* 再找下一個軸,並確保跟其他軸內積相成等於零
![Uploading file..._pb3u7qm37]()
##### Covariance

##### correlation

##### Covariance Matrix




##### Step

#### Example


#### Code(PCA)
```python=
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
df = pd.read_csv('./dataset/seeds_dataset.csv', header = None)
df = df.drop(7, axis = 1)
print(df.head())
x = np.array(df)
pca = PCA(n_components = 3)
pca.fit(x)
x_reduced = pca.transform(x)
print('singular values is {}'.format(pca.singular_values_))
print('after pca, all of data is reduced to 3D')
print(x_reduced)
```
```
0 1 2 3 4 5 6
0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175
singular values is [47.49531899 21.09635322 3.92284041]
after pca, all of data is reduced to 3D
[[ 6.63448376e-01 -1.41732098e+00 4.12356541e-02]
[ 3.15666512e-01 -2.68922915e+00 2.31726953e-01]
[-6.60499302e-01 -1.13150635e+00 5.27087232e-01]
[-1.05527590e+00 -1.62119002e+00 4.37015260e-01]
[ 1.61999921e+00 -2.18338442e+00 3.33990920e-01]
[-4.76938007e-01 -1.33649437e+00 3.55360614e-01]
[-1.84834720e-01 -1.50364411e-01 1.41497264e-01]
[-7.80629616e-01 -1.12979883e+00 2.79757608e-01]
[ 2.28210810e+00 -1.36001690e+00 -3.50729413e-01]
[ 1.97854147e+00 -1.49468793e+00 -2.93947251e-03]
[ 3.69122947e-01 8.86722511e-01 1.13264978e-01]
[-7.11021200e-01 -2.10663730e+00 1.37552595e-01]
[-1.21370535e+00 9.46878939e-02 4.85809237e-01]
[-1.16908541e+00 -7.42962899e-01 2.58209340e-01]
[-1.19272176e+00 -9.53268162e-01 2.58450639e-01]
[-5.08171207e-01 3.77958424e-01 6.56217572e-01]
[-1.37469698e+00 1.32290559e+00 8.01997838e-01]
[ 1.05726438e+00 -2.01562875e+00 4.33972249e-01]
[-1.50961097e-01 -2.02235813e+00 7.52022386e-01]
[-2.46241293e+00 7.37473835e-02 2.12996833e-01]
[-6.31332100e-01 -7.18305655e-01 -5.43487434e-02]
[-6.89698660e-01 -1.11182531e+00 -1.69906624e-02]
[ 1.40769072e+00 -2.80658086e+00 3.14889098e-01]
[-2.84267672e+00 -2.66880642e+00 -5.42113806e-02]
[ 4.33268215e-01 -1.88984464e+00 1.04814954e-01]
[ 1.81289158e+00 -2.60002176e+00 5.34196985e-02]
[-2.02131332e+00 -6.08743328e-01 1.84384422e-01]
[-2.19571862e+00 -1.49837622e+00 2.35406301e-02]
[-7.44468841e-01 -1.06518721e+00 1.56860725e-01]
[-1.50350480e+00 -3.68206745e-01 -8.59614612e-03]
[-1.52075320e+00 -3.06180225e+00 -1.73200953e-01]
[ 7.61190256e-01 -2.09488759e-01 1.65627857e-01]
[-7.67738428e-01 1.26295451e-01 -1.20622472e-01]
[-8.23965933e-01 -1.70715020e+00 5.32025512e-02]
[ 4.39542396e-01 -1.52858534e+00 -5.57773440e-02]
[ 1.52205298e+00 -1.25609762e+00 1.35317730e-01]
[ 1.65240525e+00 -6.75119440e-01 -3.33822258e-03]
[ 2.47674445e+00 -4.51537548e-01 3.09238192e-01]
[ 1.15750673e-02 -5.96250977e-01 3.52544529e-02]
[-1.11443822e+00 2.83345206e+00 5.68536979e-01]
[-1.37160170e+00 -1.30108591e+00 3.86783538e-02]
[-1.36349513e+00 -1.63960124e+00 7.21418337e-03]
[-1.88302954e+00 -1.51985066e+00 4.14988118e-01]
[ 6.29560566e-01 1.10062048e+00 6.34515325e-03]
[ 2.84412124e-01 -5.60641213e-01 2.98249761e-01]
[-9.60044753e-01 -2.29723526e+00 1.41107629e-01]
[ 8.18964617e-01 -2.26570276e+00 1.53990241e-01]
[ 1.96621301e-01 -7.40666486e-01 2.29592441e-01]
[ 1.53276057e-02 -1.02061146e+00 2.02590478e-01]
[ 2.54235169e-01 -1.55022082e+00 -1.05705755e-01]
[-5.05384531e-01 1.97773604e-01 1.75366006e-01]
[ 7.11973095e-01 1.96593925e+00 5.15648832e-01]
[-3.55829381e-01 3.79579753e-01 -1.55069874e-01]
[-5.66856332e-01 -4.55332330e-01 8.98393269e-02]
[ 1.80968891e-02 -2.21678367e+00 -3.93360924e-01]
[ 4.78424858e-01 -1.71349178e+00 -1.93034894e-01]
[-3.75464636e-01 -9.76631316e-01 3.10946012e-01]
[ 2.83883316e-01 -2.56463849e+00 2.83654402e-01]
[ 7.69429731e-01 -1.63154473e+00 1.52399604e-01]
[-2.77110124e+00 -2.60034941e+00 2.33323254e-01]
[-3.80344820e+00 -1.51695365e+00 2.37149111e-01]
[-4.00534905e+00 -1.97086343e+00 2.03609501e-01]
[-2.87823982e+00 -8.86757382e-01 4.63038181e-01]
[-1.87406423e+00 2.13355284e-01 7.89854003e-02]
[-2.05089330e+00 -2.82503499e+00 1.19739519e-01]
[-2.16820614e+00 -1.67334585e+00 4.54792497e-01]
[-2.59673795e-01 -2.44512020e+00 -6.00299972e-02]
[-7.07084427e-01 -1.59064814e+00 -5.52933377e-02]
[-2.37114217e-01 -2.28118595e+00 -1.49729134e-01]
[-2.28478840e+00 -4.60103809e-01 -1.17460180e-01]
[ 3.16409179e+00 8.04126878e-01 -2.66641999e-01]
[ 2.20955077e+00 1.27850851e+00 -1.57740826e-01]
[ 2.62062165e+00 1.18221241e+00 3.71401616e-02]
[ 4.76779625e+00 -1.57604721e-01 9.29992692e-02]
[ 2.21232336e+00 6.01138596e-01 -1.39544203e-01]
[ 2.07181452e+00 1.50200244e+00 -7.00860345e-02]
[ 2.84275994e+00 5.04000805e-01 -2.39983894e-01]
[ 6.46235442e+00 1.60085825e+00 -1.52794770e-01]
[ 4.47829765e+00 1.97530486e+00 -3.07050833e-01]
[ 2.61486732e+00 -5.12991394e-01 2.02041872e-02]
[ 1.67836371e+00 2.07292392e+00 -4.87254221e-02]
[ 4.03755261e+00 2.14071720e+00 3.50841784e-01]
[ 5.71859177e+00 2.21396348e+00 1.90304067e-01]
[ 5.58807242e+00 -1.50990711e+00 -3.07129758e-01]
[ 5.32256756e+00 -5.11477550e-02 -1.35205623e-01]
[ 4.00732381e+00 -7.29979610e-01 -3.00704179e-01]
[ 4.70505178e+00 -1.45419923e+00 -9.90969343e-02]
[ 4.79038389e+00 6.44967998e-01 -5.69065929e-01]
[ 6.69570804e+00 2.94421273e+00 3.03912945e-01]
[ 6.46037602e+00 2.15264305e+00 2.01205274e-01]
[ 6.14337744e+00 -9.43926428e-01 -4.15476808e-01]
[ 4.39514760e+00 -1.61012778e-02 -1.54841426e-03]
[ 4.46139621e+00 1.12624509e-01 -7.95827310e-02]
[ 3.78494886e+00 2.79030096e+00 3.90137034e-01]
[ 4.01629279e+00 1.80247983e+00 -6.82260736e-01]
[ 2.38051551e+00 3.23431649e-01 -3.38653786e-01]
[ 5.03718740e+00 4.35062776e-01 -1.48395998e-01]
[ 4.92049767e+00 -8.97804759e-01 -6.05902714e-01]
[ 3.94065862e+00 -3.15810290e-01 -4.91695711e-01]
[ 4.53327546e+00 -9.29513362e-01 -2.00722187e-01]
[ 1.65697361e+00 7.28440774e-01 1.41627453e-01]
[ 3.63867456e+00 -1.18043005e+00 6.80825411e-02]
[ 4.97852528e+00 1.24163180e+00 2.63041542e-01]
[ 4.94143745e+00 3.05328003e-01 -2.47908931e-01]
[ 4.63589248e+00 2.70940363e-01 -1.14295115e-01]
[ 4.52405499e+00 -5.83403419e-01 1.38095195e-01]
[ 4.51574435e+00 -2.71311394e-01 -8.74599219e-02]
[ 3.12281151e+00 4.56212973e-01 -8.62751578e-02]
[ 5.83137312e+00 3.30385849e-01 -4.76189658e-01]
[ 4.35718398e+00 -1.41741787e+00 -6.25084495e-02]
[ 4.15756069e+00 -9.50833185e-01 9.65051938e-02]
[ 5.08259001e+00 6.24704579e-01 6.28827972e-02]
[ 4.89139771e+00 -9.82917842e-01 1.28275950e-01]
[ 4.44322173e+00 3.57224575e+00 1.56974171e-01]
[ 6.67155248e+00 1.84060496e+00 9.12010421e-02]
[ 4.90749190e+00 -8.18130733e-01 -2.55684480e-01]
[ 4.37369427e+00 1.17679493e+00 4.41493473e-01]
[ 4.87191916e+00 1.48790820e-02 -9.56013893e-02]
[ 4.44858857e+00 5.06661089e-01 9.35385490e-02]
[ 5.88457254e+00 1.27049077e-01 -1.80453388e-01]
[ 5.68383523e+00 2.94064884e+00 2.60580298e-01]
[ 3.70570300e+00 4.03190365e-01 -1.09288956e-01]
[ 1.48852809e+00 7.87950654e-01 -8.22192480e-02]
[ 3.98326071e+00 -2.15055765e-01 1.49554483e-01]
[ 1.15533338e+00 -2.55675983e-01 5.98162419e-01]
[ 4.23450934e+00 1.03173593e+00 1.64406843e-01]
[ 4.21700578e+00 1.24930690e+00 -1.56980693e-01]
[ 3.62299759e+00 -9.85552845e-01 -1.99064864e-02]
[ 6.17386444e+00 -1.00395914e+00 -1.89332536e-01]
[ 2.71378793e+00 2.00947941e+00 3.92678397e-01]
[ 3.86089703e+00 -3.73513274e-01 8.03597573e-02]
[ 4.61500274e+00 -2.10261231e-01 9.79900380e-02]
[ 5.92114800e-01 8.66320178e-01 -3.00484271e-01]
[ 1.48591457e+00 7.74439749e-01 -1.72450887e-01]
[ 6.90660085e-01 1.38976691e+00 -1.68271877e-01]
[ 5.30993584e-01 -4.14621634e-02 2.10574049e-01]
[ 2.89265049e+00 2.11576887e-01 -2.20244499e-01]
[ 1.10275966e+00 -8.95136177e-01 -5.29279898e-01]
[ 1.08106342e+00 -8.23294917e-01 -3.54727287e-01]
[ 1.58041453e+00 2.92679166e-01 -2.25284288e-01]
[-2.08047727e+00 1.36508853e+00 -1.99716808e-01]
[-2.04891061e+00 3.11005161e+00 -6.38273447e-02]
[-1.93112897e+00 2.06805453e+00 3.42899489e-02]
[-3.14761497e+00 1.38674593e+00 -2.02387013e-02]
[-3.35755431e+00 3.62389743e-01 -2.78959609e-01]
[-4.22241525e+00 1.97238736e+00 -3.44133543e-01]
[-3.55211446e+00 -1.92651406e+00 -3.78402644e-01]
[-2.74248901e+00 3.68275745e-01 9.39425325e-02]
[-2.26028697e+00 -7.15757433e-01 -3.01514983e-01]
[-4.59256683e+00 1.21366537e+00 -4.10485088e-01]
[-3.49115131e+00 1.07925934e+00 -2.38194431e-01]
[-3.44038060e+00 2.89298232e+00 -2.07485411e-01]
[-2.88398331e+00 7.17984884e-01 -3.58805444e-01]
[-3.96469498e+00 -8.67031460e-01 -2.74174690e-01]
[-3.85801218e+00 -1.19622949e-01 -3.44490680e-01]
[-4.23854664e+00 1.60807548e+00 -2.98762740e-01]
[-3.89609307e+00 -8.50339426e-01 -6.58059699e-02]
[-2.98607257e+00 7.68415914e-01 -3.42792920e-01]
[-3.33747668e+00 2.84753275e-01 -5.22692614e-01]
[-3.83091968e+00 1.23660236e+00 -3.79258301e-01]
[-2.36752198e+00 -8.93936559e-01 -5.13938956e-01]
[-3.15770694e+00 1.92519249e-01 -3.20706760e-01]
[-3.23143479e+00 8.85496109e-01 -4.51321791e-02]
[-2.61469342e+00 3.94920094e-01 -8.06497416e-02]
[-4.49824136e+00 2.13615465e+00 6.33923291e-02]
[-2.94329647e+00 -1.88566242e+00 -4.82086609e-02]
[-2.76609970e+00 8.91774133e-01 -1.72181504e-01]
[-2.89907369e+00 -4.09293999e-01 -4.02323115e-01]
[-3.90252444e+00 1.58340894e-01 -2.77227460e-01]
[-3.95458414e+00 -6.73045369e-01 -2.41546685e-01]
[-4.52101870e+00 2.49812224e+00 -2.49745319e-01]
[-4.04118364e+00 2.51581412e+00 1.29405374e-01]
[-4.04672999e+00 1.00738290e-01 -9.09399508e-02]
[-4.03387689e+00 1.39431425e+00 -9.26707967e-02]
[-4.51655795e+00 9.40408717e-01 -4.06201269e-01]
[-4.67880742e+00 4.91842458e-01 -5.80895287e-02]
[-4.15214114e+00 1.12758416e+00 -1.65527388e-01]
[-4.67134302e+00 4.21033845e-01 -1.71844958e-01]
[-4.01783124e+00 1.67978706e+00 3.14971477e-03]
[-2.60793947e+00 -2.37304506e+00 -3.55362015e-01]
[-4.03445958e+00 7.40517346e-01 1.30612374e-01]
[-2.84072766e+00 9.33512708e-01 5.46936118e-02]
[-3.09276593e+00 7.75654915e-01 -5.61725952e-02]
[-3.75632591e+00 1.04703805e+00 -4.29715405e-02]
[-2.41503022e+00 2.20440679e+00 -8.67431609e-02]
[-3.57450320e+00 -7.19443063e-02 -4.01721420e-01]
[-3.37275950e+00 8.03901892e-01 -4.60390433e-01]
[-4.43114913e+00 -7.75918129e-02 -1.41347439e-01]
[-4.55059431e+00 3.26576882e+00 1.99485527e-01]
[-5.00252227e+00 6.36766609e-01 1.71812058e-01]
[-4.55827661e+00 1.13664276e+00 -9.51701336e-02]
[-4.04374701e+00 -2.25814885e-01 -6.87186458e-02]
[-3.36161741e+00 -5.27869965e-01 -4.74331964e-02]
[-4.56094950e+00 5.95574083e-01 -2.83496804e-01]
[-3.11855689e+00 3.31944329e-02 3.57435656e-02]
[-2.52946420e+00 8.37109623e-01 3.64003383e-01]
[-2.58655448e+00 1.44846001e+00 3.01071387e-01]
[-1.83336074e+00 7.30693089e-01 2.15840207e-01]
[-2.36059318e+00 -6.86820620e-01 -2.53316928e-01]
[-2.35819688e+00 -1.20488410e+00 3.56048199e-01]
[-2.97998867e+00 1.39806029e+00 1.31248700e-01]
[-2.41873891e+00 -1.74952111e+00 4.09312427e-01]
[-4.21924202e+00 -1.94251854e-01 1.18503747e-01]
[-3.08869155e+00 4.37638669e+00 4.99058361e-01]
[-2.78962325e+00 -1.41941777e-01 5.02954010e-02]
[-3.04187227e+00 -4.73126171e-01 1.95045363e-01]
[-4.10906270e+00 1.09340872e-01 -8.74005598e-02]
[-2.50003394e+00 4.30796502e+00 5.32818431e-01]
[-3.33207854e+00 -5.25289746e-01 -9.81079349e-02]
[-3.10755116e+00 1.54975743e+00 1.21282793e-01]]
```
## Kaggle 實戰
https://www.kaggle.com/c/titanic
### Goal
It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.
### Metric
Your score is the percentage of passengers you correctly predict. This is known as accuracy.
### Submission File Format
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order)
Survived (contains your binary predictions: 1 for survived, 0 for deceased)
PassengerId,Survived
892,0
893,1
894,0
Etc.
### 匯入資料
```python=
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]
```
### 觀察資料
#### 看header
```python=
print(train_df.columns.values)
```
#### Print a concise summary of a DataFrame.
```python=
train_df.info()
print('_'*40)
test_df.info()
```
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
```
#### Generate descriptive statistics.
```python=
train_df.describe()
```

#### 評分標準
```python=
round(model.score(X_train, Y_train) * 100, 2)
```
### 匯出CSV
```python=
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('submission.csv', index=False)
```