--- title: Titanic - Machine Learning from Disaster - 10836008、10836013、10836025 --- ![](https://i.imgur.com/fKqSVVr.png) # 介紹 ## 研究目的 本次選定 [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) 這個資料集,1912年4月10日,鐵達尼號展開首航,也是唯一一次的載客出航,最終目的地為紐約。部分乘客為當時世界上頂級富豪,以及許多來自英國、愛爾蘭、斯堪地那維亞和歐洲其他地區的移民,他們尋求在美國展開新生活的機會。4月15日在中途發生擦撞冰山後沉沒的嚴重災難。2,224名船上人員中有1,514人罹難,成為近代史上最嚴重的和平時期船難。 本次期望由 Kaggle上 的 dataset 進行機器學習與深度學習練習,並選定一種方法進行預測與建議。 > The data has been split into two groups: >training set (train.csv) >test set (test.csv) >The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features. >The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic. >We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. > ----截自Titanic中的Description ## 研究目標 - 分析資訊 ○ 乘客的性別、姓名、出發港口、住的艙等、房間號碼、年齡、兄弟姊妹+老婆丈夫數量(Sibsp)、父母小孩的數量(parch)、票的費用、票的號碼 - 預估乘客是否會在鐵達尼號沈船的意外中生存下來 - 使用隨機森林(Random Forest)進行預測 # 分數與排名 Top 31% ![](https://i.imgur.com/wh8FGJn.jpg) ![](https://i.imgur.com/CZMSxaY.jpg) ![](https://i.imgur.com/2NAwzXG.jpg) # 實作解析 ## 資料集介紹 ```python= import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier # 隨機森林 ``` 這次採用的是以 sklearn 為主要的學習工具,搭配 jupyter notebook ```python= train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") ``` 第一步由 pandas 讀取 csv,並存在一個 dataframe 裡 ```python= test.info() train.info() ``` ![](https://i.imgur.com/VwOQ72Y.png) 使用info()函式觀察train以及test資料是否有空值,可以看到有許多欄位存在此現象,我們發現Age、Cabin、Embarked,這些欄位都有缺失值 ```python= train.loc[train["Sex"] == "male", "Sex"] = 0 train.loc[train["Sex"] == "female", "Sex"] = 1 test.loc[test["Sex"] == "male", "Sex"] = 0 test.loc[test["Sex"] == "female", "Sex"] = 1 ``` 將男性male轉為 0 將女性female轉為 1 ```python= train["Age"] = train["Age"].fillna(train["Age"].median()) test["Age"] = test["Age"].fillna(test["Age"].median()) train["Fare"] = train["Fare"].fillna(train['Fare'].median()) test["Fare"] = test["Fare"].fillna(test['Fare'].median()) train["Parch"] = train["Parch"].fillna(train['Parch'].median()) test["Parch"] = test["Parch"].fillna(test['Parch'].median()) train["SibSp"] = train["SibSp"].fillna(train['SibSp'].median()) test["SibSp"] = test["SibSp"].fillna(test['SibSp'].median()) ``` 使用中位數來填補缺失值 ```python= test["Cabin"] = test['Cabin'].apply(lambda x : str(x)[0] if not pd.isnull(x) else 'NoCabin') train["Cabin"] = train['Cabin'].apply(lambda x : str(x)[0] if not pd.isnull(x) else 'NoCabin') test['Cabin'] = test['Cabin'].astype('category').cat.codes train['Cabin'] = train['Cabin'].astype('category').cat.codes train["Cabin"].unique() train["Cabin"] = train["Cabin"].fillna(train['Cabin'].median()) test["Cabin"] = test["Cabin"].fillna(test['Cabin'].median() ``` 觀察Cabin的資料後,只取出最前面的英文字母,剩下的用NoCabin來表示。 並且將類別資料轉為整數。 使用中位數來填補缺失值 ```python= train["Embarked"] = train["Embarked"].fillna("S") train.loc[train["Embarked"] == "S", "Embarked"] = 0 train.loc[train["Embarked"] == "C", "Embarked"] = 1 train.loc[train["Embarked"] == "Q", "Embarked"] = 2 test["Embarked"] = test["Embarked"].fillna("S") test.loc[test["Embarked"] == "S", "Embarked"] = 0 test.loc[test["Embarked"] == "C", "Embarked"] = 1 test.loc[test["Embarked"] == "Q", "Embarked"] = 2 ``` 將Embarked缺失欄位都設為S港口 ```python= predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked","Cabin"] ``` 先觀察 Fare 欄位的資料分布狀況 ```python= import matplotlib.pyplot as plt import seaborn as sns # 合併train及test的資料 df_data = train.append( test ) # 對 Fare 欄位取對數 df_data['LogFare'] = np.log1p( df_data.Fare ) # 直方圖(Histogram) fig, axs = plt.subplots( 1,2,figsize=(12,5) ) plt.subplot( 1,2,1 ) sns.distplot( df_data.Fare, kde=True, bins=45, color='skyblue', label='bins = 45' ) plt.xlabel( 'Fare' ) plt.ylabel( 'Counts' ) plt.legend( ) plt.subplot( 1,2,2 ) sns.distplot( df_data.LogFare, kde=True, bins=45, color='skyblue', label='bins = 45' ) plt.xlabel( 'Log Fare' ) plt.ylabel( '' ) plt.legend( ) plt.show() ``` ![](https://i.imgur.com/BVSU5uS.png) 藉由長條圖及盒鬚圖,觀察不同的票價組別,彼此間生存率的差異性 ```python= # 計算 Fare 欄位各個百分位數(Percentile) P_all = [ np.percentile( df_data.Fare, q=i ) for i in np.arange(0,101) ] Pth_Percentile = pd.DataFrame( { 'Q':list(range(101)), 'Value':P_all } ) # The first、second and third quartile(i,e., the 25th、50th and 75th Percentile) Q1 = Pth_Percentile.iloc[ 25, 1 ] Q2 = Pth_Percentile.iloc[ 50, 1 ] Q3 = Pth_Percentile.iloc[ 75, 1 ] IQR = Q3 - Q1 print( f'Q1 = {Q1}' ) print( f'Q2 = {Q2} = Median' ) print( f'Q3 = {Q3}' ) print( f'Maximum = {df_data.Fare.max()}') print( f'IQR = Q3 - Q1 = {IQR}' ) print( f'Q3 + 1.5IQR = {Q3+1.5*IQR}' ) # 依照四分位數,對 Fare 欄位進行分組 Fare_bin = [ 0, Q1, Q2, Q3, Q3+1.5*IQR, df_data.Fare.max() ] df_data[ 'Fare_Group' ] = pd.cut( df_data.Fare.values, Fare_bin ) Fare_bin = [ 0, Q1, Q2, Q3, Q3+1.5*IQR, df_data.Fare.max() ] df_data[ 'Fare_Group' ] = pd.cut( df_data.Fare.values, Fare_bin ) plt.subplots( figsize=(12,5) ) sns.countplot( df_data.Fare_Group, hue=df_data.Survived, palette=['lightcoral','skyblue'] ) plt.ylabel( 'Counts' ) plt.xticks( rotation=-45, fontsize=12 ) plt.show() ``` ![](https://i.imgur.com/Qz3Dd20.png) ```python= plt.subplots( figsize=(15,12) ) sns.boxplot( x='LogFare', y='Fare_Group', data=df_data, hue='Survived', orient='h', color='skyblue' ) plt.yticks( rotation=-30, fontsize=15 ) plt.show() ``` ![](https://i.imgur.com/J71D9o0.png) 由上圖可知,票價越高的組,生存率也越大,且在生還者的票價中,其中位數都高於罹難者的票價中位數。 宣告預測特徵(predictors) ```python= RFC = RandomForestClassifier(random_state=2,n_estimators=1500,min_samples_split=20,oob_score=True) RFC.fit(train[predictors], train["Survived"]) print(RFC.oob_score_) ``` ![](https://i.imgur.com/TSGuh40.png) 參數介紹 : * random_state :控制的是森林生成的模式(生成一片固定的森林),而非讓一個森林之中只有一棵樹 * n_estimators : 決策樹的個數,越多越好,但是性能隨著數字越高而越差=訓練時間越長 * min_samples_split : 內部節點再劃分所需最小樣本數 * oob_score : 是否使用袋外樣本來估計該模型大概的準確率 最後得到oob score為 0.8226711560044894 ```python= pred = RFC.predict(test[predictors]) submission = pd.DataFrame({ "PassengerId": test["PassengerId"], "Survived": pred }) submission.to_csv('submission.csv', index=False) ``` 將測試集的結果保存到submission.csv 然後將CSV檔案上傳到Kaggle # 研究結果與討論   此次分析我們使用隨機森林算法預測存活者,這次我們對於有缺漏值的欄位,選擇用中位數來補值,下一步可考慮使用其他ML模型進行補值。 # 結論   本專案尚處於初始階段,只做了簡單的資料分析以及模型訓練,即使如此正確率仍高達77.751%,這是因為選擇了較適合的模型來做訓練吧。由於森林中的每棵樹的生成法為Boostrap,即表示不會用到所有的資料去生成每棵樹;還可用袋外樣本去評估預測的準確度。 之後如有更多時間將會再試著做更多不同的觀察及嘗試。 # 參考文獻 [Random Forest算法參數解釋與優化] (https://blog.csdn.net/qq_16633405/article/details/61200502) [Random Forest(sklearn参数详解)] (https://blog.csdn.net/u012102306/article/details/52228516) [Kaggle競賽-鐵達尼號生存預測(前16%排名)] (https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC4-1%E8%AC%9B-kaggle%E7%AB%B6%E8%B3%BD-%E9%90%B5%E9%81%94%E5%B0%BC%E8%99%9F%E7%94%9F%E5%AD%98%E9%A0%90%E6%B8%AC-%E5%89%8D16-%E6%8E%92%E5%90%8D-a8842fea7077)