# **【Titanic 鐵達尼號: 預測倖存者分析】ft. Kaggle - Python**
[資料來源 : Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic)
:::info
- 一、瞭解資料內容 Checking data content
- 環境準備,使用 Python NumPy、Pandas, Matplolib、Plotly、Seaborn
- 資料來源 Data source
- 讀取資料、查看基本訊息 Import data 、View basic information
- 二、資料清理 Data cleaning
- 三、轉換資料型態 Converting data type
- 四、特徵工程 Feature engineering
- 五、分析 Analyzing
- Baseline、ML
- 測試集
- 模型
- 損失率
- 準確率
- 試著用Pytorch探索神經網絡
- 損失率
- 準確率
- [完整程式碼](https://github.com/06Cata/Kaggle_Titanic/blob/main/Kaggle_titanic.ipynb)
- [【資料分析作品集 My data analysis portfolio】](https://hackmd.io/jAchN4s6SOG4KekFGtSFtA?view)
:::
## 一、瞭解資料內容 Checking data content
### 環境準備,使用 Python NumPy、Pandas, Matplolib、Plotly、Seaborn
```=
import pandas as pd
import numpy as np
import seaborn as sb
```
<br/>
### 資料來源 Data source
下載kaggle 原始資料,我放在自己的github上


使用COLAB,複製ㄧ份到github


再從COLAB中,下載到臨時文件讀取
分了三個檔案,訓練集、測試集、測試集真正的結果
```=
import tempfile
import requests
def download_and_save_dataset(file_name):
url = f'https://github.com/06Cata//Kaggle_Titanic/raw/main/raw_data/{file_name}.csv'
response = requests.get(url)
# 將數據保存到臨時文件中
temp_file = tempfile.NamedTemporaryFile(delete=False)
temp_file.write(response.content)
temp_file_path = temp_file.name
return temp_file_path
```
```=
import pandas as pd
train_file_path = download_and_save_dataset('train')
titanic_train = pd.read_csv(train_file_path)
test_file_path = download_and_save_dataset('test')
titanic_test = pd.read_csv(test_file_path)
gender_submission_file_path = download_and_save_dataset('gender_submission')
titanic_gender_submission = pd.read_csv(gender_submission_file_path)
titanic_train
```

```=
titanic_test
```

```=
titanic_gender_submission
```


>- PassengerId: 乘客的編號
>- Survived: 乘客是否生還。1 True、0 False
>- Pclass: 艙等。1 頭等艙、2 二等艙、3 三等艙
>- Name: 乘客的姓名
>- Sex: 乘客的性別,男性 male、女性female
>- Age: 乘客的年齡
>- SibSp: 乘客在船上有多少兄弟姐妹 Siblings、配偶 Spouses
>- Parch: 乘客在船上有多少雙親 Parents、子女 Children
>- Ticket: 乘客的船票號碼
>- Fare: 乘客支付的票價
>- Cabin: 乘客的船艙號碼
>- Embarked: 乘客登船的港口 C Cherbourg、Q Queenstown、S Southampton
<br/>
### 讀取資料、查看基本訊息 Import data 、View basic information
資訊、數值統計摘要
```=
titanic_train.describe()
```

```=
df_train.isna().sum()
```

```=
df_train.info()
```

```=
# 初步查看個欄位與'Survived'關係,數值型、沒缺少值才可
# Pclass
import plotly.graph_objects as go
from scipy.stats import gaussian_kde
# 計算密度估計
pclass_not_survived = df_train[df_train['Survived'] == 0]['Pclass']
pclass_survived = df_train[df_train['Survived'] == 1]['Pclass']
kde_not_survived = gaussian_kde(pclass_not_survived)
kde_survived = gaussian_kde(pclass_survived)
# 設置範圍
pclass_range = np.linspace(0, df_train['Pclass'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(pclass_range)
density_survived = kde_survived(pclass_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=pclass_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=pclass_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Pclass by Survival Status',
xaxis_title='Pclass',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# SibSp
# 計算密度估計
sibsp_not_survived = df_train[df_train['Survived'] == 0]['SibSp']
sibsp_survived = df_train[df_train['Survived'] == 1]['SibSp']
kde_not_survived = gaussian_kde(sibsp_not_survived)
kde_survived = gaussian_kde(sibsp_survived)
# 設置範圍
sibsp_range = np.linspace(0, df_train['SibSp'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(sibsp_range)
density_survived = kde_survived(sibsp_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=sibsp_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=sibsp_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of SibSp by Survival Status',
xaxis_title='SibSp',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# Parch
# 計算密度估計
parch_not_survived = df_train[df_train['Survived'] == 0]['Parch']
parch_survived = df_train[df_train['Survived'] == 1]['Parch']
kde_not_survived = gaussian_kde(parch_not_survived)
kde_survived = gaussian_kde(parch_survived)
# 設置範圍
parch_range = np.linspace(0, df_train['Parch'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(parch_range)
density_survived = kde_survived(parch_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=parch_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=parch_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Parch by Survival Status',
xaxis_title='Parch',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# Fare
from scipy.stats import gaussian_kde
# 計算密度估計
fare_not_survived = df_train[df_train['Survived'] == 0]['Fare']
fare_survived = df_train[df_train['Survived'] == 1]['Fare']
kde_not_survived = gaussian_kde(fare_not_survived)
kde_survived = gaussian_kde(fare_survived)
# 設置範圍
fare_range = np.linspace(0, df_train['Fare'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(fare_range)
density_survived = kde_survived(fare_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=fare_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=fare_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Fare by Survival Status',
xaxis_title='Fare',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

<br/>
## 二、資料清理 Data cleaning
>cabin 刪除,缺少 77%
>age 數值型,用性別、艙等平均補
>Embarked 用眾數補
```=
print((df_train['Cabin'].isna().sum()/df_train.shape[0]*100).round(2))
```

```=
# Age,用性別、艙等平均補
df_train['Age'].fillna(value=df_train.groupby(['Sex','Pclass'])['Age'].transform('mean'),inplace=True)
df_train.head(2)
```

```=
# Embarked,用眾數補
mode_embarked = df_train['Embarked'].mode()[0]
df_train['Embarked'].fillna(value=mode_embarked, inplace=True)
df_train.head(2)
```

```=
# 確認一下
df_train.isnull().sum()
```

<br/>
## 三、轉換資料型態 Converting data type
```=
# Sex、Embarked 轉為 One-Hot
df_train['Sex_new'] = df_train['Sex'].copy()
df_train['Embarked_new'] = df_train['Embarked'].copy()
df_train = pd.get_dummies(df_train, columns=['Sex_new', 'Embarked_new'], prefix=['Sex_new', 'Embarked_new'])
df_train['Sex_new_female'] = df_train['Sex_new_female'].astype(int)
df_train['Sex_new_male'] = df_train['Sex_new_male'].astype(int)
df_train['Embarked_new_C'] = df_train['Embarked_new_C'].astype(int)
df_train['Embarked_new_Q'] = df_train['Embarked_new_Q'].astype(int)
df_train['Embarked_new_S'] = df_train['Embarked_new_S'].astype(int)
df_train
```

```=
# Sex、Embarked 轉為 LabelEncoder
from sklearn.preprocessing import LabelEncoder
label_encoder_sex_labeled = LabelEncoder()
df_train['Sex_labeled'] = label_encoder_sex_labeled.fit_transform(df_train['Sex'])
label_encoder_embarked_labeled = LabelEncoder()
df_train['Embarked_labeled'] = label_encoder_embarked_labeled.fit_transform(df_train['Embarked'])
df_train
# !pip install category_encoders
# from category_encoders.target_encoder import TargetEncoder
# target_encoder = TargetEncoder()
# df_train['Embarked'] = target_encoder.fit_transform(df_train['Embarked'])
```

```=
# 'Sex' 列的標籤對應關係
sex_mapping = dict(zip(label_encoder_sex_labeled.classes_, label_encoder_sex_labeled.transform(label_encoder_sex_labeled.classes_)))
print("Sex mapping:", sex_mapping)
# 'Embarked' 列的標籤對應關係
embarked_mapping = dict(zip(label_encoder_embarked_labeled.classes_, label_encoder_embarked_labeled.transform(label_encoder_embarked_labeled.classes_)))
print("Embarked mapping:", embarked_mapping)
```

<br/>
## 四、特徵工程 Feature engineering
```=
# 年齡多一欄,設為年齡組
bins = [0, 21, 45, 65, 100]
labels = ['0-21', '22-45', '46=65', '66-100']
df_train['AgeGroup'] = pd.cut(df_train['Age'], bins=bins, labels=labels, right=False)
df_train['AgeGroup'] = df_train['AgeGroup'].cat.codes # 轉換為數值
df_train
```

```=
# 年齡組均分,後面再看與['AgeGroup']哪個較準
#
df_train['AgeBin_3'] = pd.qcut(df_train['Age'],3)
df_train['AgeBin_4'] = pd.qcut(df_train['Age'],4)
df_train['AgeBin_5'] = pd.qcut(df_train['Age'],5)
label = LabelEncoder()
df_train['AgeBin_Code_3']=label.fit_transform(df_train['AgeBin_3'])
df_train['AgeBin_Code_4']=label.fit_transform(df_train['AgeBin_4'])
df_train['AgeBin_Code_5']=label.fit_transform(df_train['AgeBin_5'])
# cross tab
df_3 = pd.crosstab(df_train['AgeBin_Code_3'],df_train['Pclass'])
df_4 = pd.crosstab(df_train['AgeBin_Code_4'],df_train['Pclass'])
df_5 = pd.crosstab(df_train['AgeBin_Code_5'],df_train['Pclass'])
display(df_3)
display(df_4)
display(df_5)
```

```=
# 年齡分割區間,查看存活率
import matplotlib.pyplot as plt
import seaborn as sns
# 確保 df_train 已經被定義並包含 'AgeBin_Code_3', 'AgeBin_Code_4', 'AgeBin_Code_5' 和 'Survived' 欄位
# 繪製多個圖形
fig, [ax1, ax2, ax3] = plt.subplots(1, 3, sharey=True)
fig.set_figwidth(18)
for axi in [ax1, ax2, ax3]:
axi.axhline(0.5, linestyle='dashed', color='black', alpha=0.7)
# 使用 sns.barplot 並設置不同的 palette
sns.barplot(x='AgeBin_Code_3', y='Survived', data=df_train, ax=ax1, palette='Blues')
sns.barplot(x='AgeBin_Code_4', y='Survived', data=df_train, ax=ax2, palette='Greens')
sns.barplot(x='AgeBin_Code_5', y='Survived', data=df_train, ax=ax3, palette='Reds')
# 設置標題
ax1.set_title('AgeBin_Code_3 vs Survived')
ax2.set_title('AgeBin_Code_4 vs Survived')
ax3.set_title('AgeBin_Code_5 vs Survived')
plt.show()
```

```=
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 假設 df_train 已經加載並包含 'AgeBin_Code_3', 'AgeBin_Code_4', 'AgeBin_Code_5' 和 'Survived' 欄位
X1 = df_train[['AgeBin_Code_3']]
X2 = df_train[['AgeBin_Code_4']]
X3 = df_train[['AgeBin_Code_5']]
y = df_train['Survived']
model = RandomForestClassifier()
# 每種分箱方法的交叉驗證分數
scores_3 = cross_val_score(model, X1, y, cv=5, scoring='accuracy')
scores_4 = cross_val_score(model, X2, y, cv=5, scoring='accuracy')
scores_5 = cross_val_score(model, X3, y, cv=5, scoring='accuracy')
print(f"使用 AgeBin_Code_3 的準確率: {scores_3.mean():.4f} ± {scores_3.std():.4f}")
print(f"使用 AgeBin_Code_4 的準確率: {scores_4.mean():.4f} ± {scores_4.std():.4f}")
print(f"使用 AgeBin_Code_5 的準確率: {scores_5.mean():.4f} ± {scores_5.std():.4f}")
# 選擇 AgeBin_Code_4
df_train = df_train.drop(columns=['AgeBin_3', 'AgeBin_4', 'AgeBin_5', 'AgeBin_Code_3', 'AgeBin_Code_5'])
```

```=
# 家屬多設一欄,總親屬人數
df_train['Family_size'] = df_train['SibSp'].astype(int) + df_train['Parch'].astype(int) + 1
df_train
```

```=
# 提取名字中的頭銜
df_train['Title'] = df_train['Name'].str.split(", ", expand=True)[1]
df_train['Title_2'] = df_train['Title'].str.split(". ", expand=True)[0]
df_train
```

```=
df_train['Title_2'].unique()
```
```=
df_train.groupby('Title_2')['Age'].mean()
```

```=
# 交叉分析
crosstab_sex = pd.crosstab(df_train['Title_2'], df_train['Sex'])
styled_crosstab_sex = crosstab_sex.T.style.background_gradient(cmap='summer_r')
crosstab_survivrd = pd.crosstab(df_train['Title_2'], df_train['Survived'])
styled_crosstab_survived = crosstab_survivrd.T.style.background_gradient(cmap='summer_r')
styled_crosstab_sex
styled_crosstab_survived
```

```=
# 因頭銜過多,提取為 Officer 專業人士、Royalty 特殊地位
# Officer 專業人士:Captain、Colonel、Major、Doctor、Reverend
# Royalty 特殊地位:Jonkheer、Don、Sir、the Countess、Dona、Lady
title_encoding = {
"Capt":"Officer",
"Col":"Officer",
"Major":"Officer",
"Jonkheer":"Royalty",
"Don":"Royalty",
"Sir":"Royalty",
"Dr":"Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona":"Royalty",
"Mme":"Mrs",
"Mlle":"Miss",
"Ms":"Mrs",
"Mr":"Mr",
"Mrs":"Mrs",
"Miss":"Miss",
"Master":"Master",
"Lady":"Royalty"
}
df_train['Title_3'] = df_train['Title_2'].map(title_encoding)
df_train
```

```=
# 交叉分析
crosstab_sex = pd.crosstab(df_train['Title_3'], df_train['Sex'])
styled_crosstab_sex = crosstab_sex.T.style.background_gradient(cmap='summer_r')
crosstab_survivrd = pd.crosstab(df_train['Title_3'], df_train['Survived'])
styled_crosstab_survived = crosstab_survivrd.T.style.background_gradient(cmap='summer_r')
styled_crosstab_sex
styled_crosstab_survived
```

```=
# Title_3 轉為 One-Hot
df_train['Title_3_new'] = df_train['Title_3'].copy()
df_train = pd.get_dummies(df_train, columns=['Title_3_new'], prefix=['Title_3_new'])
df_train['Title_3_new_Master'] = df_train['Title_3_new_Master'].astype(int)
df_train['Title_3_new_Miss'] = df_train['Title_3_new_Miss'].astype(int)
df_train['Title_3_new_Mr'] = df_train['Title_3_new_Mr'].astype(int)
df_train['Title_3_new_Mrs'] = df_train['Title_3_new_Mrs'].astype(int)
df_train['Title_3_new_Officer'] = df_train['Title_3_new_Officer'].astype(int)
df_train['Title_3_new_Royalty'] = df_train['Title_3_new_Royalty'].astype(int)
df_train = df_train.drop(columns=['Title', 'Title_2', 'Title_3'])
df_train
```

```=
# 票價
# Making Binsl
df_train['FareBin_4'] = pd.qcut(df_train['Fare'],4)
df_train['FareBin_5'] = pd.qcut(df_train['Fare'],5)
df_train['FareBin_6'] = pd.qcut(df_train['Fare'],6)
label = LabelEncoder()
df_train['FareBin_Code_4']=label.fit_transform(df_train['FareBin_4'])
df_train['FareBin_Code_5']=label.fit_transform(df_train['FareBin_5'])
df_train['FareBin_Code_6']=label.fit_transform(df_train['FareBin_6'])
# cross tab
df_4 = pd.crosstab(df_train['FareBin_Code_4'],df_train['Pclass'])
df_5 = pd.crosstab(df_train['FareBin_Code_5'],df_train['Pclass'])
df_6 = pd.crosstab(df_train['FareBin_Code_6'],df_train['Pclass'])
display(df_4)
display(df_5)
display(df_6)
```

```=
# 票價分割區間,查看存活率
import matplotlib.pyplot as plt
import seaborn as sns
# 確保 df_train 已經被定義並包含 'FareBin_Code_4', 'FareBin_Code_5', 'FareBin_Code_6' 和 'Survived' 欄位
# 繪製多個圖形
fig, [ax1, ax2, ax3] = plt.subplots(1, 3, sharey=True)
fig.set_figwidth(18)
for axi in [ax1, ax2, ax3]:
axi.axhline(0.5, linestyle='dashed', color='black', alpha=0.7)
# 使用 sns.barplot 並設置不同的 palette
sns.barplot(x='FareBin_Code_4', y='Survived', data=df_train, ax=ax1, palette='Blues')
sns.barplot(x='FareBin_Code_5', y='Survived', data=df_train, ax=ax2, palette='Greens')
sns.barplot(x='FareBin_Code_6', y='Survived', data=df_train, ax=ax3, palette='Reds')
# 設置標題
ax1.set_title('FareBin_Code_4 vs Survived')
ax2.set_title('FareBin_Code_5 vs Survived')
ax3.set_title('FareBin_Code_6 vs Survived')
plt.show()
```

```=
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 假設 df_train 已經加載並包含 'FareBin_Code_4'、'FareBin_Code_5'、'FareBin_Code_6' 和 'Survived' 欄位
X1 = df_train[['FareBin_Code_4']]
X2 = df_train[['FareBin_Code_5']]
X3 = df_train[['FareBin_Code_6']]
y = df_train['Survived']
model = RandomForestClassifier()
# 每種分箱方法的交叉驗證分數
scores_4 = cross_val_score(model, X1, y, cv=5, scoring='accuracy')
scores_5 = cross_val_score(model, X2, y, cv=5, scoring='accuracy')
scores_6 = cross_val_score(model, X3, y, cv=5, scoring='accuracy')
print(f"使用 FareBin_Code_4 的準確率: {scores_4.mean():.4f} ± {scores_4.std():.4f}")
print(f"使用 FareBin_Code_5 的準確率: {scores_5.mean():.4f} ± {scores_5.std():.4f}")
print(f"使用 FareBin_Code_6 的準確率: {scores_6.mean():.4f} ± {scores_6.std():.4f}")
# FareBin_Code_6 的模型準確率最高(0.6825)且標準差最小(0.0446)
df_train = df_train.drop(columns=['FareBin_4', 'FareBin_5', 'FareBin_6', 'FareBin_Code_4', 'FareBin_Code_5'])
```

<br/>
## 五、分析 Analyzing
繪製關聯性圖
```=
# !pip install plotly
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
```
```=
# 0
# Sex_labeled
# female 0, male 1
from scipy.stats import gaussian_kde
# 計算密度估計
sex_not_survived = df_train[df_train['Survived'] == 0]['Sex_labeled']
sex_survived = df_train[df_train['Survived'] == 1]['Sex_labeled']
kde_not_survived = gaussian_kde(sex_not_survived)
kde_survived = gaussian_kde(sex_survived)
# 設置範圍
sex_range = np.linspace(0, df_train['Sex_labeled'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(sex_range)
density_survived = kde_survived(sex_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=sex_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=sex_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Sex_labeled by Survival Status,female 0, male 1',
xaxis_title='Sex_labeled',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# 0
# Embarked_labeled
# C 0, Q 1, S 2
from scipy.stats import gaussian_kde
# 計算密度估計
embarked_not_survived = df_train[df_train['Survived'] == 0]['Embarked_labeled']
embarked_survived = df_train[df_train['Survived'] == 1]['Embarked_labeled']
kde_not_survived = gaussian_kde(embarked_not_survived)
kde_survived = gaussian_kde(embarked_survived)
# 設置範圍
embarked_range = np.linspace(0, df_train['Embarked_labeled'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(embarked_range)
density_survived = kde_survived(embarked_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=embarked_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=embarked_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Embarked_labeled by Survival Status, C 0, Q 1, S 2',
xaxis_title='Embarked_labeled',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# 0
# Age
from scipy.stats import gaussian_kde
# 計算密度估計
age_not_survived = df_train[df_train['Survived'] == 0]['Age']
age_survived = df_train[df_train['Survived'] == 1]['Age']
kde_not_survived = gaussian_kde(age_not_survived)
kde_survived = gaussian_kde(age_survived)
# 設置範圍
age_range = np.linspace(0, df_train['Age'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(age_range)
density_survived = kde_survived(age_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=age_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=age_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Age by Survival Status',
xaxis_title='Age',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# 0
# AgeGroup
from scipy.stats import gaussian_kde
# 計算密度估計
age_group_not_survived = df_train[df_train['Survived'] == 0]['AgeGroup']
age_group_survived = df_train[df_train['Survived'] == 1]['AgeGroup']
kde_not_survived = gaussian_kde(age_group_not_survived)
kde_survived = gaussian_kde(age_group_survived)
# 設置範圍
age_group_range = np.linspace(0, df_train['AgeGroup'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(age_group_range)
density_survived = kde_survived(age_group_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=age_group_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=age_group_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of AgeGroup by Survival Status',
xaxis_title='AgeGroup',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# 0
# Family_size
from scipy.stats import gaussian_kde
# 計算密度估計
family_size_not_survived = df_train[df_train['Survived'] == 0]['Family_size']
family_size_survived = df_train[df_train['Survived'] == 1]['Family_size']
kde_not_survived = gaussian_kde(family_size_not_survived)
kde_survived = gaussian_kde(family_size_survived)
# 設置範圍
family_size_range = np.linspace(0, df_train['Family_size'].max(), 1000)
# 計算估計密度
density_not_survived = kde_not_survived(family_size_range)
density_survived = kde_survived(family_size_range)
#
fig = go.Figure()
# 未生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=family_size_range,
y=density_not_survived,
mode='lines',
name='Not Survived',
fill='tozeroy',
line=dict(color='lightcoral')
))
# 生還乘客的 KDE 曲線
fig.add_trace(go.Scatter(
x=family_size_range,
y=density_survived,
mode='lines',
name='Survived',
fill='tozeroy',
line=dict(color='lightgreen')
))
fig.update_layout(
title='KDE Plot of Family size by Survival Status',
xaxis_title='Family size',
yaxis_title='Density',
width=1000,
height=400
)
fig.show()
```

```=
# 1
# ticlet
# 找到擁有重複票號(Ticket)的乘客
deplicate_ticket = []
for tk in df_train.Ticket.unique():
tem = df_train.loc[df_train.Ticket == tk, 'Fare']
if tem.count() > 1:
deplicate_ticket.append(df_train.loc[df_train. Ticket == tk,['Name', 'Ticket', 'Fare', 'Cabin', 'Family_size', 'Survived']])
deplicate_ticket = pd.concat(deplicate_ticket)
deplicate_ticket_sorted = deplicate_ticket.sort_values(by='Family_size', ascending=False)
deplicate_ticket_sorted
```

```=
ticket_stats = deplicate_ticket_sorted.groupby('Ticket').agg(
family_size=('Family_size', 'first'), # 假設所有相同票號的家庭成員數量是一致的
survived_avg=('Survived', 'mean')
).reset_index()
ticket_stats.sort_values(by='family_size', ascending=False)
```

```=
plt.figure(figsize=(10, 6))
sns.scatterplot(x='family_size', y='survived_avg', data=ticket_stats)
plt.title('Family Size vs Survived Average')
plt.xlabel('Family Size')
plt.ylabel('Survived Average')
plt.show()
```

比較一下onehot、label是否有差
```=
# 1
# df_train_onehot
columns_to_select_onehot = [
'Survived','Sex_new_female','Sex_new_male', 'Embarked_new_C', 'Embarked_new_Q', 'Embarked_new_S'
]
df_train_onehot = df_train[columns_to_select_onehot]
df_train_onehot.head()
```

```=
# df_encoded = df_train_onehot.select_dtypes(include=[np.number])
palette = {0: "red", 1: "green"}
sb.pairplot(df_train_onehot, hue="Survived", palette=palette)
```

```=
# 1
# df_train_label
columns_to_select_label = [
'Survived', 'Sex_labeled', 'Embarked_labeled'
]
df_train_label = df_train[columns_to_select_label]
df_train_label.head()
```

```=
palette = {0: "red", 1: "green"}
sb.pairplot(df_train_label, hue="Survived", palette=palette)
```

```=
# 1
# 選擇onehot
# 選取數值欄位
df_train_encoded = df_train.select_dtypes(include=[np.number])
# 相關係數矩陣
correlation_matrix = df_train_encoded.corr().round(2)
correlation_matrix.head()
```

```=
plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,annot=True,cmap='RdBu_r',linewidths=0.2)
fig=plt.gcf()
plt.title("Correlation Heatmap of Titanic")
plt.show()
print(correlation_matrix['Survived'].sort_values(ascending=False))
# 性別: 女性 > 男性
# 艙等: 高艙等 > 低艙等
# 艙位等級與存活呈現負相關。1為一等艙,表示越高等級的艙位,存活率越高
# 票價: 高票價 > 低票價
# 票價與存活呈現正相關,票價越高,存活率越高
# 登船港口: Cherbourg > Southampton
# Parch 相關係數為 0.08,有父母或子女在船上的乘客,存活率略高
# SibSp 相關係數為 -0.04,有兄弟姐妹或配偶在船上的乘客,存活率的更低,但影響非常小
# Family_size 為 0.02,家庭成員總數與存活率幾乎沒有相關性
# AgeBin_Code_4 為 -0.03, 表示這種年齡分箱對生還機率幾乎沒有影響
# Age 為 -0.07, 年齡越大,生還的機率略低,但影響非常小
# AgeGroup 為 -0.07, 年齡越大,生還的機率略低,但影響非常小
# 稱謂提供了對生還機率的有用信息,特別是區分不同性別和年齡段
# Title_3_new_Mrs 為 0.34 、Title_3_new_Miss 為 0.33,與 Survived 有中等程度的正相關,表明稱謂為 "Mrs" 的人(已婚女性)、 "Miss" 的人(未婚女性)生還的機率較高
# Title_3_new_Master 為 0.09 ,與 Survived 有較弱的正相關,表明稱謂為 "Master" 的人(年輕男性)生還的機率略高
# Title_3_new_Officer 為 -0.03 ,與 Survived 的負相關非常弱,表示稱謂為 "Officer" 的人在生還機率上幾乎沒有影響
# Title_3_new_Mr 為 -0.55, 與 Survived 的負相關中等偏強,表明稱謂為 "Mr" 的人(成年男性)生還的機率較低
# 總體來看,性別、票價和艙等是對生還機率影響較大的特徵,而年齡和登船港口的影響則較小
```


```=
# 1
# 特徵重要性
from sklearn.ensemble import RandomForestClassifier
X = df_train_encoded.drop(['Survived'], axis=1)
y = df_train_encoded['Survived']
# 建立並擬合模型
model = RandomForestClassifier()
model.fit(X, y)
# 計算特徵重要性
feature_importance = model.feature_importances_
#
feature_importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importance
})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df
```

```=
# 2
# 查看年齡分怖
## matplotlib
# plt.figure(figsize=(8, 6))
# sns.boxplot(x=df_train['Age'], color='skyblue')
# plt.title('Age Quartiles')
# plt.xlabel('Age')
# plt.show()
fig = px.box(df_train, y='Age', points='all', color_discrete_sequence=['blue'])
#
fig.update_layout(
title='Age Quartiles',
yaxis_title='Age',
xaxis_title='',
width=800,
height=600
)
fig.show()
```

```=
# 2
# 各年齡組存活率
# agegroup_survival = df_train.groupby('AgeGroup')['Survived'].mean()
# #
# plt.figure(figsize=(10, 6))
# agegroup_survival.plot(kind='bar', color='skyblue')
# plt.title('Survival Rate by Age Group')
# plt.xlabel('Age Group')
# plt.ylabel('Survival Rate')
# plt.xticks(rotation=0)
# plt.show()
# #
# print(agegroup_survival)
agegroup_survival = df_train.groupby('AgeGroup')['Survived'].mean().reset_index()
#
fig = px.bar(
agegroup_survival,
x='AgeGroup',
y='Survived',
title='Survival Rate by Age Group',
labels={'Survived': 'Survival Rate', 'AgeGroup': 'Age Group'},
color_discrete_sequence=['skyblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 2
# 各年齡組存活率
agegroup_survival = df_train.groupby(['AgeGroup', 'Sex'])['Survived'].mean().reset_index()
#
fig = px.bar(
agegroup_survival,
x='AgeGroup',
y='Survived',
title='Survival Rate by AgeGroup and Sex',
color='Sex',
barmode='group',
labels={'Survived': 'Survival Rate', 'AgeGroup': 'AgeGroup'},
color_discrete_sequence=['lightcoral', 'lightblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 2
# 各年齡組存活人數
agegroup_survival_counts = df_train.groupby(['AgeGroup', 'Survived']).size().unstack()
# 顯示未存活的人數
trace1 = go.Bar(
x=agegroup_survival_counts.index,
y=agegroup_survival_counts[0],
name='Not Survived',
marker_color='lightcoral'
)
# 顯示存活的人數
trace2 = go.Bar(
x=agegroup_survival_counts.index,
y=agegroup_survival_counts[1],
name='Survived',
marker_color='lightgreen'
)
#
fig = go.Figure()
fig.add_trace(trace1)
fig.add_trace(trace2)
fig.update_layout(barmode='group')
fig.update_layout(
title='Passenger Count by Age Group and Survival Status',
xaxis_title='Age Group',
yaxis_title='Count',
legend_title='Survival Status',
width=1000,
height=600
)
fig.show()
#
print(agegroup_survival_counts)
```

```=
# 3
# 各艙等存活率
pclass_survival = df_train.groupby('Pclass')['Survived'].mean().reset_index()
#
fig = px.bar(
pclass_survival,
x='Pclass',
y='Survived',
title='Survival Rate by Pclass',
labels={'Survived': 'Survival Rate', 'Pclass': 'Pclass'},
color_discrete_sequence=['skyblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 3
# 各艙等存活率
# sns.countplot(x='Embarked', hue='Survived', data=df_train)
# plt.xlabel('Embarked')
# plt.ylabel('Count')
# plt.title('Survived count by Embarked')
# plt.show()
pclass_sex_survival = df_train.groupby(['Pclass', 'Sex'])['Survived'].mean().reset_index()
#
fig = px.bar(
pclass_sex_survival,
x='Pclass',
y='Survived',
title='Survival Rate by Pclass and Sex',
color='Sex',
barmode='group',
labels={'Survived': 'Survival Rate', 'Pclass': 'Pclass'},
color_discrete_sequence=['lightcoral', 'lightblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 4
# 各港口存活率
embarked_survival = df_train.groupby('Embarked')['Survived'].mean().reset_index()
#
fig = px.bar(
embarked_survival,
x='Embarked',
y='Survived',
title='Survival Rate by Embarked',
labels={'Survived': 'Survival Rate', 'Embarked': 'Embarked'},
color_discrete_sequence=['skyblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 4
# 各港口存活率
embarked_survival = df_train.groupby(['Embarked', 'Sex'])['Survived'].mean().reset_index()
#
fig = px.bar(
embarked_survival,
x='Embarked',
y='Survived',
title='Survival Rate by Embarked and Sex',
color='Sex',
barmode='group',
labels={'Survived': 'Survival Rate', 'Embarked': 'Embarked'},
color_discrete_sequence=['lightcoral', 'lightblue']
)
fig.update_layout(xaxis_tickangle=0)
fig.update_layout(
width=1000,
height=600
)
fig.show()
```

```=
# 4
# 各港口存活人數
sns.countplot(x='Embarked', hue='Survived', data=df_train)
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.title('Survived count by Embarked')
plt.show()
```

<br/>
## Baseline、ML
```=
# 查看準確度7
# 多比較,找到準確率較高的,這裡我嘗試了七種,發現'Sex'、'Embarked'都用Label,準確率比用onehot高
columns_X = list(set(df_train_encoded.columns) - {'Survived', 'PassengerId', 'SibSp', 'Parch', 'AgeBin_Code_4', 'Sex_new_female', 'Sex_new_male', 'Embarked_new_C', 'Embarked_new_Q','Embarked_new_S'})
columns_y = ['Survived']
train_X = df_train_encoded[columns_X]
train_y = df_train_encoded[columns_y]
# Logistic_Regression
log = LogisticRegression(random_state=0, max_iter=3000)
scores_log = cross_val_score(log, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_log)
# Decision_Tree
decision_tree = DecisionTreeClassifier()
scores_decision_tree = cross_val_score(decision_tree, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_decision_tree)
# Random_Forest_Classifier
rfc = RandomForestClassifier(n_estimators=100)
scores_rfc = cross_val_score(rfc, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_rfc)
# Support_Vector_Machines
svc = SVC()
scores_svc = cross_val_score(svc, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_svc)
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
scores_knn = cross_val_score(knn, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_knn)
# Gaussian_Naive_Baye
gaussian = GaussianNB()
scores_gaussian = cross_val_score(gaussian, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_gaussian)
# Gradient_Boosting_Classifier
Gradient = GradientBoostingClassifier()
scores_gradient = cross_val_score(gaussian, train_X, train_y.values.ravel(),cv=5,scoring='accuracy').mean()
print(scores_gradient)
```

<br/>
## 測試集
跟訓練集做的步驟相同
```=
# 一、瞭解資料內容 Checking data content
# 環境準備,使用 Python NumPy、Pandas, Matplolib、Plotly、Seaborn
# 資料來源 Data source
# 讀取資料、查看基本訊息 Import data 、View basic information
url = 'https://raw.githubusercontent.com/06Cata/Kaggle_Titanic/main/raw_data/test.csv'
df_test = pd.read_csv(url)
df_test.head(10)
```

```=
print(df_test.info())
print()
print((df_test['Cabin'].isna().sum()/df_test.shape[0]*100).round(2))
print()
print(df_test.isna().sum())
# Cabin 刪除,缺少 77%
# age 數值型,用性別、艙等平均補
# Embarked 用眾數補
# Fare 用同 Pclass 均價補
```

```=
# 二、資料清理 Data cleaning
# Age,數值型,用性別、艙等平均補
df_test['Age'].fillna(value=df_test.groupby(['Sex','Pclass'])['Age'].transform('mean'),inplace=True)
df_test
```

```=
# Embarked,用眾數補
mode_embarked = df_test['Embarked'].mode()[0]
df_test['Embarked'].fillna(value=mode_embarked, inplace=True)
df_test
```

```=
# Fare 用同 Sex, Pclass 均價補
df_test['Fare'].fillna(value=df_test.groupby(['Sex','Pclass'])['Fare'].transform('mean'),inplace=True)
df_test
```

```=
# Sex、Embarked 轉為 One-Hot
df_test['Sex_new'] = df_test['Sex'].copy()
df_test['Embarked_new'] = df_test['Embarked'].copy()
df_test = pd.get_dummies(df_test, columns=['Sex_new', 'Embarked_new'], prefix=['Sex_new', 'Embarked_new'])
df_test['Sex_new_female'] = df_test['Sex_new_female'].astype(int)
df_test['Sex_new_male'] = df_test['Sex_new_male'].astype(int)
df_test['Embarked_new_C'] = df_test['Embarked_new_C'].astype(int)
df_test['Embarked_new_Q'] = df_test['Embarked_new_Q'].astype(int)
df_test['Embarked_new_S'] = df_test['Embarked_new_S'].astype(int)
df_test
```

```=
# Sex、Embarked 轉為 LabelEncoder
from sklearn.preprocessing import LabelEncoder
label_encoder_sex_labeled = LabelEncoder()
df_test['Sex_labeled'] = label_encoder_sex_labeled.fit_transform(df_test['Sex'])
label_encoder_embarked_labeled = LabelEncoder()
df_test['Embarked_labeled'] = label_encoder_embarked_labeled.fit_transform(df_test['Embarked'])
df_test
```

```=
# 四、特徵工程 Feature engineering
# 年齡多一欄,設為年齡組
bins = [0, 21, 45, 65, 100]
labels = ['0-21', '22-45', '46=65', '66-100']
df_test['AgeGroup'] = pd.cut(df_test['Age'], bins=bins, labels=labels, right=False)
df_test['AgeGroup'] = df_test['AgeGroup'].cat.codes # 轉換為數值
df_test
```

```=
# 年齡組均分
#
df_test['AgeBin_4'] = pd.qcut(df_test['Age'],4)
label = LabelEncoder()
df_test['AgeBin_Code_4']=label.fit_transform(df_test['AgeBin_4'])
# cross tab
df_4 = pd.crosstab(df_test['AgeBin_Code_4'],df_test['Pclass'])
display(df_4)
df_test = df_test.drop(columns=['AgeBin_4'])
```

```=
# 家屬多設一欄,總親屬人數
df_test['Family_size'] = df_test['SibSp'].astype(int) + df_test['Parch'].astype(int) + 1
df_test
```

```=
# 提取名字中的頭銜
df_test['Title'] = df_test['Name'].str.split(", ", expand=True)[1]
df_test['Title_2'] = df_test['Title'].str.split(". ", expand=True)[0]
# Officer 專業人士:Captain、Colonel、Major、Doctor、Reverend
# Royalty 特殊地位:Jonkheer、Don、Sir、the Countess、Dona、Lady
title_encoding = {
"Capt":"Officer",
"Col":"Officer",
"Major":"Officer",
"Jonkheer":"Royalty",
"Don":"Royalty",
"Sir":"Royalty",
"Dr":"Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona":"Royalty",
"Mme":"Mrs",
"Mlle":"Miss",
"Ms":"Mrs",
"Mr":"Mr",
"Mrs":"Mrs",
"Miss":"Miss",
"Master":"Master",
"Lady":"Royalty"
}
df_test['Title_3'] = df_test['Title_2'].map(title_encoding)
df_test
```

```=
# Title_3轉為One-Hot
df_test['Title_3_new'] = df_test['Title_3'].copy()
df_test = pd.get_dummies(df_test, columns=['Title_3_new'], prefix=['Title_3_new'])
df_test['Title_3_new_Master'] = df_test['Title_3_new_Master'].astype(int)
df_test['Title_3_new_Miss'] = df_test['Title_3_new_Miss'].astype(int)
df_test['Title_3_new_Mr'] = df_test['Title_3_new_Mr'].astype(int)
df_test['Title_3_new_Mrs'] = df_test['Title_3_new_Mrs'].astype(int)
df_test['Title_3_new_Officer'] = df_test['Title_3_new_Officer'].astype(int)
df_test['Title_3_new_Royalty'] = df_test['Title_3_new_Royalty'].astype(int)
df_test = df_test.drop(columns=['Title','Title_2','Title_3'])
df_test
```

```=
# 票價
# Making Binsl
df_test['FareBin_6'] = pd.qcut(df_test['Fare'],6)
label = LabelEncoder()
df_test['FareBin_Code_6']=label.fit_transform(df_test['FareBin_6'])
# cross tab
df_6 = pd.crosstab(df_test['FareBin_Code_6'],df_test['Pclass'])
display(df_6)
```

```=
df_test_encoded = df_test.select_dtypes(include=[np.number])
```
<br/>
## 模型
```=
print(df_train_encoded.columns)
print(df_test_encoded.columns)
```

```=
# 選擇要放進模型的欄位,一定要int
# 前面查看準確度,選擇準確度最高, columns - {'Survived', 'PassengerId', 'SibSp', 'Parch', 'AgeBin_Code_4', 'Sex_new_female', 'Sex_new_male', 'Embarked_new_C', 'Embarked_new_Q','Embarked_new_S'}
columns_X = list(set(df_train_encoded.columns) - {'Survived', 'PassengerId', 'SibSp', 'Parch', 'AgeBin_Code_4', 'Sex_new_female', 'Sex_new_male', 'Embarked_new_C', 'Embarked_new_Q','Embarked_new_S'})
columns_y = ['Survived']
train_X = df_train_encoded[columns_X]
train_y = df_train_encoded[columns_y]
```
```=
columns_test_X = list(set(df_test_encoded.columns) - {'SibSp', 'Parch', 'PassengerId', 'SibSp', 'Parch', 'AgeBin_Code_4', 'Sex_new_female', 'Sex_new_male', 'Embarked_new_C', 'Embarked_new_Q','Embarked_new_S'})
test_X = df_test_encoded[columns_test_X]
```
```=
# Random_Forest_Classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(train_X, train_y.values.ravel())
#
test_y_list = rfc.predict(test_X)
df_test_encoded['Survived_pred'] = test_y_list
df_test_encoded.head()
```

```=
test_y = df_test_encoded[['Survived_pred']]
```
```=
print(train_X.shape)
print(train_y.shape)
print(test_X.shape)
print(test_y.shape)
```

```=
# 測試集真正的結果
url = 'https://raw.githubusercontent.com/06Cata/Kaggle_Titanic/main/raw_data/gender_submission.csv'
df_test_result = pd.read_csv(url)
df_test_result.head(10)
```

```=
df_test_total = pd.merge(df_test_encoded, df_test_result, on='PassengerId')
df_test_total
```

### 損失率
```=
# 計算損失率,隨機森林模型的訓練和測試損失率都是固定的
from sklearn.metrics import log_loss
train_proba = rfc.predict_proba(train_X)
train_loss = log_loss(train_y, train_proba)
test_proba = rfc.predict_proba(test_X)
test_loss = log_loss(test_y, test_proba)
print(f'Train Loss: {train_loss:.4f}')
print(f'Test Loss: {test_loss:.4f}')
# 畫出損失率
loss_list = [train_loss] * 10 # 模擬多個 epoch 的訓練損失
test_loss_list = [test_loss] * 10 # 模擬多個 epoch 的測試損失
plt.plot(loss_list, linewidth=3)
plt.plot(test_loss_list, linewidth=3)
plt.legend(("Training Loss", "Validation Loss"))
plt.xlabel("Epoch")
plt.ylabel("Log Loss")
plt.show()
```

### 準確率
```=
# 準確率
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(df_test_total['Survived'], df_test_total['Survived_pred'])
print(f"Accuracy: {accuracy}")
```

<br/>
## 試著用Pytorch探索神經網絡
```=
# TensorFlow, to_categorical ,目標變數進行獨熱編碼
# PyTorch 創建神經網路模型
import tensorflow as tf
import torch
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
train_y_onehot = tf.keras.utils.to_categorical(train_y, num_classes=2) # 兩個類別,0 和 1
test_y_onehot = tf.keras.utils.to_categorical(test_y, num_classes=2)
print(train_y_onehot.shape)
print(test_y_onehot.shape)
```

```=
# 創建模型架構
class Model(torch.nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(Model, self).__init__()
self.hidden1 = torch.nn.Linear(input_size, hidden_size) # input_size 維度的輸入,輸出 hidden_size # 線性層
self.hidden2 = torch.nn.Linear(hidden_size, hidden_size) # hidden_size 維度的輸入,同樣輸出 hidden_size 維度的特徵
self.predict = torch.nn.Linear(hidden_size, output_size) # hidden_size 維度的特徵映射到 output_size 維度,用於預測目標類別
def forward(self, x):
output1 = F.relu(self.hidden1(x))
output2 = F.relu(self.hidden2(output1))
output = self.predict(output2)
return output
```
```=
# 模型、優化器初始化
model = Model(test_X.shape[1], 32, 2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # 使用 momentum 來加速訓練
loss_func = torch.nn.CrossEntropyLoss() # 多類別分類問題的損失函數
# 資料格式轉換為 torch 張量
train_X_data = torch.tensor(train_X.values, dtype=torch.float32)
train_y_data = torch.tensor(train_y.values.ravel(), dtype=torch.long) # 使用 ravel() 攤平成一維
test_X_data = torch.tensor(test_X.values, dtype=torch.float32)
test_y_data = torch.tensor(test_y.values.ravel(), dtype=torch.long) # 使用 ravel() 攤平成一維
batch_size = 64
num_epochs = 150
num_batches = len(train_X) // batch_size
loss_list = []
test_loss_list = []
for epoch in range(num_epochs):
for i in range(num_batches):
start = i * batch_size
end = start + batch_size
prediction = model(train_X_data[start:end])
loss = loss_func(prediction, train_y_data[start:end])
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 計算訓練集損失
loss = loss_func(model(train_X_data), train_y_data)
loss_list.append(loss.item())
# 計算測試集損失
test_loss = loss_func(model(test_X_data), test_y_data)
test_loss_list.append(test_loss.item())
# 印出每個 epoch 訓練和測試損失
print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {loss.item():.4f}, Test Loss: {test_loss.item():.4f}')
# 訓練完成後,評估模型準確率
model.eval()
with torch.no_grad():
train_predictions = torch.argmax(model(train_X_data), dim=1).numpy()
test_predictions = torch.argmax(model(test_X_data), dim=1).numpy()
train_accuracy = accuracy_score(train_y, train_predictions)
test_accuracy = accuracy_score(test_y_data, test_predictions)
print(f'Train Accuracy: {train_accuracy:.4f}, Test Accuracy: {test_accuracy:.4f}')
```

```=
# 資料格式轉換為 torch 張量
x_test_data = torch.tensor(test_X.values, dtype=torch.float32)
# 使用模型進行預測
with torch.no_grad():
y_pred = model(x_test_data)
y_pred = y_pred.argmax(1).numpy() # one hot array to int array
df_test_total['Survived_pred_pytorch'] = y_pred
df_test_total
```

### 損失率
```=
# 損失率
plt.plot(loss_list, linewidth=3)
plt.plot(test_loss_list, linewidth=3)
plt.legend(("Training Loss", "Validation Loss"))
plt.xlabel("Epoch")
plt.ylabel("BCE Loss")
```

### 準確率
```=
# 準確率
accuracy_score(test_y, y_pred)
```
