---
title: Project07
tags: teach:MF
---
# Santander Customer Transaction Prediction
- Motivations and Why
kaggle 題目:Santander Customer Transaction Prediction
(https://www.kaggle.com/c/santander-customer-transaction-prediction)
(1)動機:kaggle上看到的題目,我認為是比較實際且市場需求
取向的,這個題目是要幫忙預測哪些資料對消費有預測能力,所有的資料皆沒有名稱,所以無法用主觀方式進行判斷,需要透過許EDA去
抽絲剝繭,最後在進行模型預測。
(2)keywords:Customer Transaction, lightgbm
- [data](https://www.kaggle.com/c/santander-customer-transaction-prediction/data)
### 變數
| 資料 | 型態
| ---- | ----
|test.csv |共有200個為給名稱之解釋變數、target為Binary應變數(0、1)|
|train.csv|同上 |
### 1.EDA
- 導入套件
```python=
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
#import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score, auc, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
from IPython.display import display
warnings.filterwarnings('ignore')
```
- 先用 pandas 的 Dataframe 進行資料的導入,並觀察資料的比數以及資料的樣子。
```python=
#讀取dataframe
test_df = pd.read_csv('test.csv')
train_df = pd.read_csv('train.csv')
#觀察資料筆數
train_df.shape, test_df.shape
display(test_df.head())
display(train_df.head())
```
- 觀察是否有缺值,計算缺值個數以及缺值所佔之比例,利用 concat 合併兩資料,檢查 test 跟 train 資料中皆無缺值需要處理。
```python=
def missing_data(data):
total = data.isnull().sum()
percent = (data.isnull().sum()/data.isnull().count()*100)
tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
types = []
for col in data.columns:
dtype = str(data[col].dtype)
types.append(dtype)
tt['Types'] = types
return(np.transpose(tt))
#test跟train data都檢查
print(missing_data(test_df))
print(missing_data(train_df))
#並無缺值
train_df.describe()
```

- 觀察 test 與 train data 是否有關聯性,若是存在正相關或是負相關等趨勢,則兩筆資料並不適合拿 來做後面 model 處理。我們觀察 test 與 train data 變數之間為隨機分佈,因此為為合理的資料。 X軸train data ; Y軸為test data.
```python=
def plot_feature_scatter(df1, df2, features):
i = 0
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(4,4,figsize=(14,14))
for feature in features:
i += 1
plt.subplot(4,4,i)
plt.scatter(df1[feature], df2[feature], marker='+')
plt.xlabel(feature, fontsize=9)
plt.show();
features = ['var_0', 'var_1','var_2','var_3', 'var_4', 'var_5', 'var_6', 'var_7',
'var_8', 'var_9', 'var_10','var_11','var_12', 'var_13', 'var_14', 'var_15',]
plot_feature_scatter(train_df[::20],test_df[::20], features)
```

- 我們的目標是分析 target 是否為 0 或是 1,不過 target 的確切是什麼並沒有說,統計target分別為1及為0的個數,而資料顯示多數target皆為零,因此我猜測這裡想要了解 target=1時狀況。
```python=
#發現約10% target=1
sns.countplot(train_df['target'], palette='Set3')
```

- 接著由於我們有許多變數,要如何分析哪些變數是有用,哪些變數較無相關,可以利用機率密度 函數,分別對各個變數的train data之target=0及target=1畫機率密度函數,若是變數沒有解釋能力,則應該以常態 分配呈現。因此從較特別的分配可以挑出有用的變數。
```python=
def plot_feature_distribution(df1, df2, label1, label2, features):
i = 0
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(10,10,figsize=(18,22))
for feature in features:
i += 1
plt.subplot(10,10,i)
sns.distplot(df1[feature], hist=False,label=label1)
sns.distplot(df2[feature], hist=False,label=label2)
plt.xlabel(feature, fontsize=9)
locs, labels = plt.xticks()
plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
plt.tick_params(axis='y', which='major', labelsize=6)
plt.show();
t0 = train_df.loc[train_df['target'] == 0]
t1 = train_df.loc[train_df['target'] == 1]
features = train_df.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features)
features = train_df.columns.values[102:202]
plot_feature_distribution(t0, t1, '0', '1', features)
```


- 變數間之correlation並無相關性


### 2.Benchmark method: Logistic model
$P(Y|X=1)=\dfrac{e^{\beta^ x}}{1+e^{\beta x}}$
$Y= Target$
$X = 其他Variables$
$test data =0.3$
- 先將資料的ID跟target切割,其餘變數切割成train跟test兩個部分,以logistic model進行迴歸預測。
```python=
train_log = train_df.drop(columns = ['target', 'ID_code'])
test_log = test_df.drop(columns = ['ID_code'])
Target = train_df['target']
#切割train及test資料集
X_train, X_test, Y_train, Y_test = train_test_split(train_log, Target, test_size= 0.3, random_state = 2019)
logist = LogisticRegression()
logist.fit(X_train, Y_train)
logist_pred = logist.predict_proba(X_test)[:,1]
def performance(Y_test, logist_pred):
Y_predict = [0 if i < 0.5 else 1 for i in logist_pred]
print('Confusion Matrix:')
print(confusion_matrix(Y_test, Y_predict))
fpr, tpr, thresholds = roc_curve(Y_test, Y_predict, pos_label=1)
print('AUC:')
print(auc(fpr, tpr))
performance(Y_test, logist_pred)
```
#### 結果

### 3.decision tree
```python=
tree_clf = DecisionTreeClassifier(class_weight='balanced', random_state = 2019,
max_features = 0.7, min_samples_leaf = 80)
tree_clf.fit(X_train, Y_train)
tree_preds = tree_clf.predict_proba(X_test)[:, 1]
performance(Y_test, tree_preds)
```
#### 結果

### 4.Model Lightgbm
用lightgbm model 進行訓練,再以Kfold 進行交叉驗證,簡促模型的穩定度。
lightgbm model是基於決策樹(Decision Tree)學習算法的梯度提升框架,是基於XGBoost演化的方法,優點在於計算較為快速,且佔用內存較少。
```python=
features = [c for c in train_df.columns if c not in ['ID_code', 'target']]
target = train_df['target']
#parameter
param = {
'bagging_freq': 5,
'bagging_fraction': 0.4,
'boost_from_average':'false',
'boost': 'gbdt',
'feature_fraction': 0.05,
'learning_rate': 0.01,
'max_depth': -1,
'metric':'auc',
'min_data_in_leaf': 80,
'min_sum_hessian_in_leaf': 10.0,
'num_leaves': 13,
'num_threads': 8,
'tree_learner': 'serial',
'objective': 'binary',
'verbosity': 1
}
##run model
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df.values, target.values)):
print("Fold {}".format(fold_))
trn_data = lgb.Dataset(train_df.iloc[trn_idx][features], label=target.iloc[trn_idx])
val_data = lgb.Dataset(train_df.iloc[val_idx][features], label=target.iloc[val_idx])
num_round = 1000000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
cols = (feature_importance_df[["Feature", "importance"]]
.groupby("Feature")
.mean()
.sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]
plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')
```


### 5. Analysis and Conclusion
- 1.上述問題中是典型的分群問題,在變數資訊未知且無法利用domain knowlege的情況下,依賴EDA進行觀察。在benchmark model中,我嘗試利用主觀判斷target=1及0分佈變異較大的前10名進行邏輯斯回歸,發現精準程度不如將全部變數加入迴歸分析中,且差距明顯。由此可見即便透過變數分配,單純直覺分析仍不可靠。
- 2. 透過decision tree預測能力benchmark method差不多。
- 3. lightgbm在分類上顯著優於以上兩者,顯示在特徵分類上,lightgbm演算法有顯著優勢。
- 4. feature importance 與前面所提到變數分配得變異性有正相關性,但是無法用肉眼捕捉所有較有影響力之特徵。但lightgbm缺點為容易過擬合,因此若樣本數較小,仍可以利用變數之分配來進行挑選。