# Predicting the Survival of Titanic Passengers
#### 這是我研究所ML的課程中老師的其中一項作業,作業內容是給定鐵達尼號乘客的資訊並預測鐵達尼號生還者,其中準確率最高者可以不用期末報告。
# import he kits we need to use
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split
Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.linear_model import LogisticRegression
# load dataset
```python=
col_names = ['PassengerId','Survived','Pclass','Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked','Family']####'Ticket'
pima = pd.read_csv("new_titanic.csv", header=0, names=col_names)
pima.head()
```

#### 其中,我把資料先做了處理,例如Sex:male=0,female=1 Embarked:S=1,C=2,Q&else=3 Cabin:NaN=0,A=1,B=2,C=3,D=4,E=5,F=6,另外"Age"這個欄位有10%的空格,所以我利用其他的欄位全部加起來做平均
# 特徵處理
### 開啟 CSV 檔案
import csv
with open('titanic.csv', newline='') as csvfile:
### 讀取 CSV 檔案內容
rows = csv.reader(csvfile)
sum_age=0
number=0
#### 處理"Age"的平均歲數
```python=
for row in rows:
if row[5]== "" or row[5]=='Age':
continue
else:
sum_age=float(row[5])+sum_age
number=number+1
print(sum_age,number,sum_age/number)
```
##### (Age:714有紀錄 總共21205.17 平均為29.69911764705882)
##### 判斷性別,row[4]為"Age"這個欄位的位置
```python=
if row[4]=='Sex':
dat=[row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7],
row[8],row[9],row[10],row[11],"Family"]
writer.writerow(dat)
continue
elif row[4]=='male':
sex="0"
else:
sex="1"
```
### 判斷Embarked
##### row[11]為"Embarked"這個欄位的位置,其中Embarked(登船港口)總共有'S','C','Q'這三種位置,所以我就簡單分為0,1,2。
```python=
if row[11]=='S':
Embarked=0
elif row[11]=='C':
Embarked=1
elif row[11]=='Q':
Embarked=2
else:
Embarked=3
```
### 判斷Cabin
##### row[10]為"Cabin"這個欄位的位置,在Cabin(船艙號碼)欄位資料中有些開頭是英文字母或數字,所以我認為鐵達尼號是因開頭字母和數字來區分船艙號碼,因此把他區分為9類(A,B....G,數字)
```python=
if row[10][:1]=='':
Cabin=0
elif row[10][:1]=='A':
Cabin=1
elif row[10][:1]=='B':
Cabin=2
elif row[10][:1]=='C':
Cabin=3
elif row[10][:1]=='D':
Cabin=4
elif row[10][:1]=='E':
Cabin=5
elif row[10][:1]=='F':
Cabin=6
elif row[10][:1]=='G':
Cabin=7
else:
Cabin=8
```
### 輸入age空格的部分
```python=
if row[5]=="":
age=29.69911764705882
else:
age=row[5]
```
### 判斷tickets
##### row[8]為"Ticket"這個欄位的位置,其中Ticket(傳票號碼)有些值前面會有英文字母,我認為後面的數字比較重要,所以把前面的英文字母當作雜質處理掉。
```python=
import re
tickets = row[8]
tickets = re.sub("[A-Za-z\u4e00-\u9fa5\,\。/.LINE]", "", tickets)
tickets=tickets.replace(" ", "")
if tickets=="":
tickets=0.0
tickets=float(tickets)
print(tickets)
```
### 判斷 Name
##### row[3]為"Name"這個欄位的位置,我把Name(名字)分為Mrs. Mr. Miss 這三種類別。
```python=
row[3]=0
if "Mrs."in row[3]:
row3=1
elif "Mr." in row[3]:
row3=2
elif 'Miss'in row[3] :
row3=3
else:
row3=0
```
## ==split dataset in features and target variable==
feature_cols = ['Pclass', 'Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']
#### 在試過好幾次特徵的組合後發現上述這幾個的特徵組合的結果最好!!
X = pima[feature_cols] # Features
y = pima.Survived # Target variable
## Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # 70% training and 30% test
print(y.shape)

# I choose to use xgboost's model
```python=
from xgboost import XGBClassifier
from xgboost import plot_importance
clf = XGBClassifier(learning_rate=0.101,
n_estimators=200, # 樹的個數
max_depth=3, # 樹的深度
min_child_weight = 4, # 葉子節點最小權重
gamma=0., # 懲罰項中葉子節點個數前的參數
subsample=0.8, # 隨機選擇80%的樣本建立決策樹
colsample_btree=0.8, # 隨機選擇80%的特徵建立決策樹
scale_pos_weight=1, # 解決樣本個數不平均的問題
random_state=0 # 隨機數
)
clf=clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
```
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# ==混淆矩陣==
```python=
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print('混淆矩陣:',cm)
print(classification_report(y_test,y_pred))
```

#### 可以從上圖看到 precision,recall,f1-score這三種數值的表現都不錯,準確率也有87%
#### 從這次的作業我學到了,不是每個特徵都會對於準確率都有幫助,把特徵分得更細也可能會導致準確率變低得可能,所以在做ML的過程中,我應該要時常注意每個小細節和各種特徵工程,特徵,參數的組合,才能讓效果更好!