# 201912 玉山數據競賽心得
# 特徵工程
* 該客戶(bacno)近三天使用此卡次數
* bacno所使用此卡累計次數
* 消費金額之差異,by(id/卡號)
* 第幾筆交易之差異by(id/卡號)
* 交易量占比by(id/卡號)
* 近期消費權重by(id/卡號)
* 最近一次交易為0的時間差(天差)
* 同id/卡號同小時的同時間內的交易次數
* 區分平日假日
* 每張卡號在此筆消費前
* 帳號在同特店/國別/幣別/時間/mcc的平均花費(mean)與標準差(std),個筆交易金額與上述平均花費比例(ratio)
* 帳號(bacno)在同特店,國別,幣別,mcc累計交易次數!!!!!!!!!!
* 卡號(cano)同一天中總交易次數,金額,平均交易金額!!!!!!!!!!!!!
在相同特店,國別,幣別,mcc的累計交易次數
* 已卡號在過去三小時,三天,同天中的平均消費行為和帳號消費行為計算比例(ratio)
* 卡號該比較伊與前好個三筆交易的時間差(time data)
* 卡號在相同特店,國別前後各三筆交易的時間差(X)
* 將特店,國別,幣別,mcc等作frequency encoding後,計算group mean , std 以卡號當天的平均值和帳號總平均計算比例(ratio)
* 類別特徵如時間區段,交易類別,交易型台,支付型態,狀態碼等做one-hot,以one hot的值計算卡號group mean
* groupby 多數features,與conam計算mean \std (計算每個使用者卡號的當天交易平均與周平均金額)
* bacno算與其他features的mean
* fraud risk ratio所有國家盜刷的比例
* 某天某小時的交易行為/帳號整體行為(ex 卡號當天在相同特店交易次數/帳號平均一天在相同特店交易次數)
* 當天內交易時間的標準差
* the time difference between transaction and transaction with zero amount(conam) (if exist)
* the minimum/maximum amount(conam) of transaction of the card(cano) durning a day(lodct)
* the days(locdt) difference between the first and last transaction with same card(cano) and merchant(mchno)
* the n -th transaction with same card(cano) and merchant(mchno)
* the days(locdt) difference between the transaction and the last transaction with same card(cano)
* thes days(locdt) difference between cardA(canoA) and carfB(canoB) with same user(bacno)
* special features(white_list/blacklist)of the merchant(mchno) with same user(bacno)
* special features-blacklist of transaction amount(cano) with same card(cano)
* special features-the days(locdt) difference between the transaction and the first fraudulent transaction(if exist)
* number of transaction for each cano in 7 days!!!
* total transaction amount(conam) for each cano in 7days!!!
* Number of consumption country(stocn)for each cano in 2days
* number of scity/acqic/stocn/csmcu/etymd /hcefg/contp for each cano
* correlation between data and data in similar time
* num of transaction for each day
* num of transaction - mean(transaction number for each day)
* n of transaction for each bacno/acqic/cano
* the average of conam for each bano
* conam - the average of conam for each bano!!!!!!!!!!!!!
*
# 抽樣方法
* 依據時間切fold很重要,避免train出 overfit model
* groupby bacno
* group kfold
# 建模技巧
* xgboost / catboost / lgbm /
* 用到ap
* 將一些 training sets出現的但testing sets未出現的類別變數當成NA
* binary encoding
* pseudo labeling
* autoencoder not fraudata and if data cant decode properly(input -output>threshold ) will be detremine as fraud data
# 疑問
* 洩漏資料(換cano / stscd連續且唯一的2 / ecfg連續且唯一的1)

* binary encoding
* entropy base regularization
* boosting(avoid overfit)
# 參考資料
1.

2.

3.

4.

5.
threshold = 0.36
# Use base model for default
df_test['fraud_ind'] = np.where(df_test['sub_base_model']> threshold, 1, 0)
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==1) & (df_test['sub_fraud_mchno_model'] > threshold), 1, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==1) & (df_test['sub_fraud_mchno_model'] <= threshold), 0, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['diff_with_first_fraud_locdt'] >= 1) & (df_test['sub_first_fraud_model'] > threshold), 1, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['mchno_in_normal_mchno_list']>0) & (df_test['sub_normal_mchno_model'] > threshold), 1, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['mchno_in_normal_mchno_list']>0) & (df_test['sub_normal_mchno_model'] <= threshold), 0, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['conam_in_fraud_conam_list']==1) & (df_test['sub_fraud_conam_model'] > threshold), 1, df_test['fraud_ind'])
df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['conam_in_fraud_conam_list']==1) & (df_test['sub_fraud_conam_model'] <= threshold), 0, df_test['fraud_ind'])
df_test[['txkey','fraud_ind']].to_csv('sub_{}.csv'.format(threshold),index = False)
special_feautures = [
'mchno_in_normal_mchno_list',# 這間特電在過去的交易中有出現且是正常的
'mchno_in_fraud_mchno_list',# 這間特電在過去的交易中有出現且是盜刷的
'conam_in_fraud_conam_list',# 金額在過去的交易中有出現且是異常的
'diff_with_first_fraud_locdt'#與該卡號第一次被判盜刷距今的交易時間
]