Titanic - Machine Learning from Disaster 鐵達尼號生存預測資料分析篇 [資料分析與機器學習系列]

Titanic - Machine Learning from Disaster 鐵達尼號生存預測資料分析篇 [資料分析與機器學習系列] === ###### tags `資料分析`,`機器學習`,`Kaggle`,`Titanic`,`鐵達尼號生存預測` 這篇文章主要運用kaggle鐵達尼號生存預測資料 **train.csv**，手把手的帶大家一起完成基礎的資料分析，內容包含**探索資料、特徵工程與建模到實際玩玩看**，為了能夠快速上手，省去了許多較為複雜的統計知識，讓我們一起運用實作進入資料的世界吧:smile: 強調一下，本文章只運用 **train.csv** 帶著大家走過一次資料分析，若想要跑一次資料分析流程以及透過實作有趣認識資料分析，那這篇文章很適合。但若是想要進一步優化，可以概略此篇文章，然後進一步至網路上查詢，網路有許多符合你需求的文章，這邊提供一些連結，請參閱底下Further Reading區塊。 [圖片來源](https://www.history.com/.image/t_share/MTc2NTQ1ODM1NDQwNDgyMDU4/sinking-of-the-titanic-gettyimages-542907919-1.jpg) ![image alt](https://www.history.com/.image/t_share/MTc2NTQ1ODM1NDQwNDgyMDU4/sinking-of-the-titanic-gettyimages-542907919-1.jpg) ## 認識資料要進行分析前可以先去Kaggle觀察資料並下載。 [Download train.csv ](https://www.kaggle.com/c/titanic/data) ![](https://i.imgur.com/5FWGkoh.png) 現在，我們至Data 裡的 Data Dictionary 大致看一下我們拿到的資料，包含乘客的艙等(pClass)、性別(Sex)、年齡(Age)、兄弟姊妹＋老婆丈夫數量(SibSp)、父母小孩的數量(Parch)、票號(Ticket)、票的費用(Fare)、出發港口(Embarked)、房間號碼(Cabin)等，運用這些資料去**預估乘客是否會在鐵達尼號意外中生存下來**。 ![](https://i.imgur.com/fxm5zq2.png) ## 載入套件與匯入資料 ``` # Import Libaries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # load the data train_data = pd.read_csv('train.csv') ``` ## 探索性資料分析(Exploratory Data Analysis) 1. 印出前十筆資料觀察欄位 ``` #Print the first 10 rows of the data train_data.head(10) ``` ![](https://i.imgur.com/FBsYOgH.png) 2. 觀察資料總共有幾列幾欄 ``` # count the number of rows and coloumns in the dataset train_data.shape ``` ![](https://i.imgur.com/5kSzrbE.png) :::info 總共有 891 列 12 欄(特徵) ::: 3. 觀察資料統計資訊 ``` # Get some statistics train_data.describe() ``` ![](https://i.imgur.com/cCPuNtG.png) :::info 觀察出 Age 少了一些資料，觀察到一些欄位的最大最小值、平均值等，如可以看到有人花512.329 British pounds 搭這艘船，最小值花 0 元搭這艘船 ::: 4. 觀察 891 人最後到底有多少人存活 ``` # Get a count of the number of survivors train_data['Survived'].value_counts() # Visualize the count of survivors sns.countplot(train_data['Survived']) ``` ![](https://i.imgur.com/YVaC1lE.png) ![](https://i.imgur.com/X5Djx8v.png) :::info 觀察出只有 **342** 人活了下來 ::: 5. 進一步運用圖表簡單觀察說，性別、艙等、碼頭等欄位那些人在這場災難中存活下來。 ``` # Visualize the count of survivors for columns 'sex','pclass','sibsp','parch','embarked' cols = ['Sex','Pclass','SibSp','Parch','Embarked'] n_rows = 2 n_cols = 3 # The subplot grid and figure size of each graph fig, axs = plt.subplots(n_rows,n_cols,figsize = (n_cols * 3.2, n_rows * 3.2)) for r in range(0, n_rows): for c in range(0, n_cols): i = r * n_cols + c # index to go through the number of columns if i < 5: ax = axs[r][c] #show where to position each sub plots sns.countplot(train_data[cols[i]], hue = train_data['Survived'], ax= ax) ax.set_title(cols[i]) ax.legend(title = 'survived', loc = 'upper right') plt.tight_layout() ``` ![](https://i.imgur.com/Sy3xlLU.png) :::info 大致可以觀察出 * 女性存活的機會比男性來得高 * 頭等艙存活機會較高 * 可能有帶兄弟姊妹、老婆丈夫的乘客存活機會較高 * 有帶小孩父母親的存活機會較高 * 從S碼頭出發有可能艙位比較低，存活機會較低 ::: 6. 看一下性別存活率 ``` # Look at survival rate by sex train_data.groupby('Sex')[['Survived']].mean() ``` ![](https://i.imgur.com/4Sg2Dbs.png) :::info 觀察出 * 女性存活率大約有 74.2% * 男性存活率大約只有 18.9% ::: 6.1 性別 + 艙等 ``` #Look at survival rate by sex and class train_data.pivot_table('Survived',index = 'Sex', columns = 'Pclass') ``` ![](https://i.imgur.com/urVa1Xp.png) :::info 觀察出女性坐頭等艙存活機會高達 96.8% 反觀，男性若坐經濟艙那存活率只勝 13.54% ::: 6.2 性別 + 年齡 + 艙等 ``` # Look at survival rate by sex, age and class age = pd.cut(train_data['Age'],[0, 18, 80]) train_data.pivot_table('Survived',['Sex',age], 'Pclass') ``` ![](https://i.imgur.com/VbfdO4j.png) :::info 觀察出女性坐頭等艙商務艙不管年紀，存活機會皆有 90 % 反觀，男性如果坐經濟艙且年紀大於 18 歲，那存活機會只有 13.36% 上述總結**可以推論出，在災難情況下，小孩女性存活機會較男性大，而艙位越高(頭等艙) 那存活機會又可以增加**。 ::: 7. 這時想觀察一下，各艙的價錢落在多少，看我有沒有機會存活率拉高一些XDD ``` # plot the price paid of each class plt.scatter(train_data['Fare'],train_data['Pclass'], color = 'purple', label = 'Passenger Paid') plt.ylabel('Class') plt.xlabel('Price / Fare') plt.title('Price of Each Class') plt.legend() plt.show() # someone paid over 500 pounds for fiest class. it looks like every class somebody paid zero pounds looks interesting. ``` ![](https://i.imgur.com/wEdb7iF.png) :::info 觀察出每一艙位都有人花 0 英鎊就搭上船，也有人花了 500 多英鎊搭到頭等艙，是一個蠻有趣的發現! ::: ## 特徵工程 ### 資料清理 1. 有無缺失值 ``` # Count the empty values in each column train_data.isna().sum() ``` ![](https://i.imgur.com/05JrA2k.png) :::info 觀察出 Age 、 Cabin 與 Embarked 三個欄位都有缺失值，其中 Cabin 缺失值高達 687 筆。 ::: 2. 進行處裡以及把不必要欄位 drop 掉 * 目前欄位中， PassengerId、Name、Ticket(票號)，沒有要拿來分析且 Cabin 缺失值過多，因此選擇直接將此欄刪除。 :::warning 採用常見缺失值處理方法之一 : 把有缺失值的那筆資料，整理直接刪除。 ::: ``` # Drop the columns train_data = train_data.drop(['Cabin','Name','PassengerId'], axis=1) train_data = train_data.drop(['Ticket'] , axis =1 ) # Remove the rows with missing values train_data = train_data.dropna(subset = ['Embarked','Age']) ``` 2.1 來看一下處理完剩下幾列幾欄(特徵) ``` # Count the new number of rows and columns in the dataset train_data.shape ``` ![](https://i.imgur.com/7yJHbiV.png) :::info 目前資料有 **712** 列與 **8** 個欄位 (特徵) ::: ### 編碼 3. 看一下這 8 個欄位(特徵) 資料型態，因為模型裡只能放入數字型資料，所以稍微觀察一下。 ``` # Look at the data types train_data.dtypes ``` ![](https://i.imgur.com/OUU060d.png) :::info 觀察出 'Sex' 與 'Embarked' **是非數值型資料**，因此我們需進行編碼。 ::: 3.1 觀察這兩欄，裡頭有哪些值。 ``` # Print the unique values in the columns print(train_data['Sex'].unique()) print(train_data['Embarked'].unique()) ``` ![](https://i.imgur.com/BRyuvk2.png) :::info 'Sex' 欄位有 ['male', 'female'] 'Embarked' 欄位有 ['S', 'C','Q'] ::: 3.2 這邊採用 label encoder 進行處理 ``` from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() # Encode the sex column train_data.iloc[:, 2] = labelencoder.fit_transform(train_data.iloc[:, 2].values) # Encode the embarked column train_data.iloc[:, 7] = labelencoder.fit_transform(train_data.iloc[:, 7].values) ``` :::warning **為甚麼要做編碼?** 目的 : 類別 (categorical)或是文字(text)的資料轉換成數字，而讓程式能夠去理解及運算 ::: 3.3 再次觀察這兩欄 ``` # Print the unique values in the columns print(train_data['Sex'].unique()) print(train_data['Embarked'].unique()) ``` ![](https://i.imgur.com/2IdgzkM.png) :::info 'Sex' 欄位有 ['male', 'female'] - > [1, 0] 'Embarked' 欄位有 ['S', 'C','Q'] - > [2,0,1] ::: 3.4 再次觀察所有欄位資料型態 ``` train_data.dtypes ``` ![](https://i.imgur.com/dvWU0iS.png) :::info 目前欄位都為數值型資料了!!!! 這樣待會就可以順利匯入模型，進行預測了。 ::: 4. 準備一下資料， X 表示我們未來手上會有的資料(乘客資訊) y 為 Survived (船難中是否存活) 。 ``` # Split the data into independent 'X' and dependent 'y' variables X = train_data.iloc[:, 1:8].values y = train_data.iloc[:, 0].values ``` 5. 將現有資料切分 80 % 作為訓練資料集 20 % 為測試資料集 ``` # Split the dataset into 80% training and 20% testing from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) ``` ### 資料標準化(standardization) ``` #Scale the data from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_trian = sc.fit_transform(X_train) X_test = sc.transform(X_test) ``` :::warning **為甚麼要做標準化?** 因為在資料中，是用不同資料欄位與資料值所組成，他們**分佈狀況可能都不盡相同**，因此，就必須將**特徵資料按比例縮放，讓資料落在某一特定的區間**。 ::: ## 建模 1. **基本分類模型建模 function** ``` # Create a function with many mechaine learning models def models(X_train, y_train): # Use Logistic Regression from sklearn.linear_model import LogisticRegression log = LogisticRegression(random_state = 0) log.fit(X_train, y_train) # Use KNeighbors from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors= 5, metric = 'minkowski', p = 2) knn.fit(X_train, y_train) # Use SVC (linear kernal) from sklearn.svm import SVC svc_lin = SVC(kernel='linear', random_state = 0) svc_lin.fit(X_train,y_train) # Use SVC (RBF kernal) from sklearn.svm import SVC svc_rbf = SVC(kernel='rbf', random_state = 0) svc_rbf.fit(X_train,y_train) # Use GaussianNB from sklearn.naive_bayes import GaussianNB gauss = GaussianNB() gauss.fit(X_train, y_train) # Use Dicision Tree from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) tree.fit(X_train, y_train) # Use the RandomForestClassifier from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) forest.fit(X_train,y_train) #Print the training accuracy for each model print('[0]Logistic Regression Training Accuracy:', log.score(X_train, y_train)) print('[1]K Neighbors Training Accuracy:', knn.score(X_train, y_train)) print('[2]SVC Linear Training Accuracy:', svc_lin.score(X_train, y_train)) print('[3]SVC RBF Training Accuracy:', svc_rbf.score(X_train, y_train)) print('[4]Gaussian NB Training Accuracy:', gauss.score(X_train, y_train)) print('[5]Decision Tree Training Accuracy:', tree.score(X_train, y_train)) print('[6]Random Forest Training Accuracy:', forest.score(X_train, y_train)) return log,knn,svc_lin,svc_rbf,gauss,tree,forest ``` 2. **訓練集資料成果** ``` # Get the train all of the models model = models(X_train, y_train) ``` ![](https://i.imgur.com/N6arXth.png) :::info 可以發現 'Decision Tree' 與 'Random Forest' 訓練不錯，訓練資料集準確率高達 95 % 以上。 ::: 3. 預測**測試資料集、混淆矩陣與 Accuracy 分數** ``` # Show the confusion matrix and accuracy for all of the models of the test data from sklearn.metrics import confusion_matrix for i in range( len(model) ): cm = confusion_matrix(y_test, model[i].predict(X_test)) # Extract TN, FP, FN, TP TN, FP, FN, TP = confusion_matrix(y_test, model[i].predict(X_test)).ravel() test_score = (TP + TN) / (TN + TP + FP + FN) print(cm) print('Model[{}] Testing Accuracy = "{}"' .format(i, test_score)) print() ``` ![](https://i.imgur.com/lh4sbOQ.png) ![](https://i.imgur.com/HDThlOi.png) :::info 可以發現 'Decision Tree' 測試資料集還有 79 % 的準確率。 ::: ## 額外觀察進一步觀察 Decision model，重要特徵(欄位)排序，也就是那些影響結果最多 ``` # Get feature impornt decisionTree = model[5] importances = pd.DataFrame({'feature': train_data.iloc[:, 1:8].columns, 'importance': np.round(decisionTree.feature_importances_, 3)} ) importances = importances.sort_values('importance', ascending = False).set_index('feature') importances # visualize the importance importances.plot.bar() ``` ![](https://i.imgur.com/xRljeNA.png) ![](https://i.imgur.com/wbrME9Z.png) ## 玩玩看 Can You Survive? ``` # Pclass int64 # Sex int32 male,female [1,0] # Age float64 # SibSp int64 # Parch int64 # Fare float64 # Embarked int32 S,C,Q [2,0,1] # My survival my_survival = [[2, 0, 22, 1, 2, 14.45, 1]] # Scaling my survival my_survival_scaled = sc.transform(my_survival) #Print prediction of my survival using Decision Tree Classifier pred = model[5].predict(my_survival_scaled) print(pred) if pred == 0: print('Oh no! You did not make it') else: print('Nice! You Survived') ``` ![](https://i.imgur.com/BJJ6hRN.png) ## 結論藉由 Kaggle 上鐵達尼號**train.csv**，我們一起跑過簡單基礎的資料分析及資料處理，希望文中的程式碼以及解釋，能幫助大家初步認識資料分析並快速上手，同時希望能使讀者瞭解一個機器學習專案的基本構思及流程，通常一個機器學習專案，我們會耗費絕大部分的時間在資料分析與特徵工程上。最後，對於鐵達尼號生存預測比賽有興趣或是想再進行優化，不妨試著至Kaggle實際提出資料與觀看討論區，或是其他網站查詢相信你將會獲得多元的收穫與經驗！ ## Reference [Titanic Survival Prediction Using Machine Learning](https://www.youtube.com/watch?v=rODWw2_1mCI) ## Further Reading [Kaggle Titanic - Machine Learning from Disaster 討論區](https://www.kaggle.com/c/titanic/discussion) [Kaggle競賽-鐵達尼號生存預測(Top 3%)](https://yulongtsai.medium.com/https-medium-com-yulongtsai-titanic-top3-8e64741cc11f) [Predicting the Survival of Titanic Passengers](https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8) :::warning 若有興趣關注我的文章可按左邊訂閱! 可以給我讚賞給我鼓勵。 IG->duck_tech_ 新創立帳號歡迎追蹤會不定期分享 LeetCode 挑戰 |網頁全端與資料科學學習筆記 | 自我成長經驗 :love_letter: ::: :::success 針對本文的內容，若讀者們有發現任何的錯誤或疑問，非常歡迎您至IG私訊給予建議及討論，讓我們一同來學習成長！ :::