# Predicting the Survival of Titanic Passengers #### 這是我研究所ML的課程中老師的其中一項作業,作業內容是給定鐵達尼號乘客的資訊並預測鐵達尼號生還者,其中準確率最高者可以不用期末報告。 # import he kits we need to use import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split Import train_test_split function from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation from sklearn.linear_model import LogisticRegression # load dataset ```python= col_names = ['PassengerId','Survived','Pclass','Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked','Family']####'Ticket' pima = pd.read_csv("new_titanic.csv", header=0, names=col_names) pima.head() ``` ![](https://i.imgur.com/36h0oNB.png) #### 其中,我把資料先做了處理,例如Sex:male=0,female=1 Embarked:S=1,C=2,Q&else=3 Cabin:NaN=0,A=1,B=2,C=3,D=4,E=5,F=6,另外"Age"這個欄位有10%的空格,所以我利用其他的欄位全部加起來做平均 # 特徵處理 ### 開啟 CSV 檔案 import csv with open('titanic.csv', newline='') as csvfile: ### 讀取 CSV 檔案內容 rows = csv.reader(csvfile) sum_age=0 number=0 #### 處理"Age"的平均歲數 ```python= for row in rows: if row[5]== "" or row[5]=='Age': continue else: sum_age=float(row[5])+sum_age number=number+1 print(sum_age,number,sum_age/number) ``` ##### (Age:714有紀錄 總共21205.17 平均為29.69911764705882) ##### 判斷性別,row[4]為"Age"這個欄位的位置 ```python= if row[4]=='Sex': dat=[row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7], row[8],row[9],row[10],row[11],"Family"] writer.writerow(dat) continue elif row[4]=='male': sex="0" else: sex="1" ``` ### 判斷Embarked ##### row[11]為"Embarked"這個欄位的位置,其中Embarked(登船港口)總共有'S','C','Q'這三種位置,所以我就簡單分為0,1,2。 ```python= if row[11]=='S': Embarked=0 elif row[11]=='C': Embarked=1 elif row[11]=='Q': Embarked=2 else: Embarked=3 ``` ### 判斷Cabin ##### row[10]為"Cabin"這個欄位的位置,在Cabin(船艙號碼)欄位資料中有些開頭是英文字母或數字,所以我認為鐵達尼號是因開頭字母和數字來區分船艙號碼,因此把他區分為9類(A,B....G,數字) ```python= if row[10][:1]=='': Cabin=0 elif row[10][:1]=='A': Cabin=1 elif row[10][:1]=='B': Cabin=2 elif row[10][:1]=='C': Cabin=3 elif row[10][:1]=='D': Cabin=4 elif row[10][:1]=='E': Cabin=5 elif row[10][:1]=='F': Cabin=6 elif row[10][:1]=='G': Cabin=7 else: Cabin=8 ``` ### 輸入age空格的部分 ```python= if row[5]=="": age=29.69911764705882 else: age=row[5] ``` ### 判斷tickets ##### row[8]為"Ticket"這個欄位的位置,其中Ticket(傳票號碼)有些值前面會有英文字母,我認為後面的數字比較重要,所以把前面的英文字母當作雜質處理掉。 ```python= import re tickets = row[8] tickets = re.sub("[A-Za-z\u4e00-\u9fa5\,\。/.LINE]", "", tickets) tickets=tickets.replace(" ", "") if tickets=="": tickets=0.0 tickets=float(tickets) print(tickets) ``` ### 判斷 Name ##### row[3]為"Name"這個欄位的位置,我把Name(名字)分為Mrs. Mr. Miss 這三種類別。 ```python= row[3]=0 if "Mrs."in row[3]: row3=1 elif "Mr." in row[3]: row3=2 elif 'Miss'in row[3] : row3=3 else: row3=0 ``` ## ==split dataset in features and target variable== feature_cols = ['Pclass', 'Sex','Age','SibSp','Parch','Fare','Cabin','Embarked'] #### 在試過好幾次特徵的組合後發現上述這幾個的特徵組合的結果最好!! X = pima[feature_cols] # Features y = pima.Survived # Target variable ## Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # 70% training and 30% test print(y.shape) ![](https://i.imgur.com/EITpREc.png) # I choose to use xgboost's model ```python= from xgboost import XGBClassifier from xgboost import plot_importance clf = XGBClassifier(learning_rate=0.101, n_estimators=200, # 樹的個數 max_depth=3, # 樹的深度 min_child_weight = 4, # 葉子節點最小權重 gamma=0., # 懲罰項中葉子節點個數前的參數 subsample=0.8, # 隨機選擇80%的樣本建立決策樹 colsample_btree=0.8, # 隨機選擇80%的特徵建立決策樹 scale_pos_weight=1, # 解決樣本個數不平均的問題 random_state=0 # 隨機數 ) clf=clf.fit(X_train, y_train) y_pred = clf.predict(X_test) ``` # Model Accuracy print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) ![](https://i.imgur.com/dDQQAyP.png) # ==混淆矩陣== ```python= from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test,y_pred) print('混淆矩陣:',cm) print(classification_report(y_test,y_pred)) ``` ![](https://i.imgur.com/C7VdvxX.png) #### 可以從上圖看到 precision,recall,f1-score這三種數值的表現都不錯,準確率也有87% #### 從這次的作業我學到了,不是每個特徵都會對於準確率都有幫助,把特徵分得更細也可能會導致準確率變低得可能,所以在做ML的過程中,我應該要時常注意每個小細節和各種特徵工程,特徵,參數的組合,才能讓效果更好!