201912 玉山數據競賽心得

# 201912 玉山數據競賽心得 # 特徵工程 * 該客戶(bacno)近三天使用此卡次數 * bacno所使用此卡累計次數 * 消費金額之差異，by(id/卡號) * 第幾筆交易之差異by(id/卡號) * 交易量占比by(id/卡號) * 近期消費權重by(id/卡號) * 最近一次交易為0的時間差（天差） * 同id/卡號同小時的同時間內的交易次數 * 區分平日假日 * 每張卡號在此筆消費前 * 帳號在同特店/國別/幣別/時間/mcc的平均花費(mean)與標準差(std)，個筆交易金額與上述平均花費比例（ratio） * 帳號(bacno)在同特店，國別，幣別，mcc累計交易次數!!!!!!!!!! * 卡號(cano)同一天中總交易次數，金額，平均交易金額!!!!!!!!!!!!! 在相同特店，國別，幣別，mcc的累計交易次數 * 已卡號在過去三小時，三天，同天中的平均消費行為和帳號消費行為計算比例（ratio） * 卡號該比較伊與前好個三筆交易的時間差(time data) * 卡號在相同特店，國別前後各三筆交易的時間差(X) * 將特店，國別，幣別，mcc等作frequency encoding後，計算group mean , std 以卡號當天的平均值和帳號總平均計算比例（ratio） * 類別特徵如時間區段，交易類別，交易型台，支付型態，狀態碼等做one-hot，以one hot的值計算卡號group mean * groupby 多數features，與conam計算mean \std (計算每個使用者卡號的當天交易平均與周平均金額) * bacno算與其他features的mean * fraud risk ratio所有國家盜刷的比例 * 某天某小時的交易行為/帳號整體行為(ex 卡號當天在相同特店交易次數/帳號平均一天在相同特店交易次數) * 當天內交易時間的標準差 * the time difference between transaction and transaction with zero amount(conam) (if exist) * the minimum/maximum amount(conam) of transaction of the card(cano) durning a day(lodct) * the days(locdt) difference between the first and last transaction with same card(cano) and merchant(mchno) * the n -th transaction with same card(cano) and merchant(mchno) * the days(locdt) difference between the transaction and the last transaction with same card(cano) * thes days(locdt) difference between cardA(canoA) and carfB(canoB) with same user(bacno) * special features(white_list/blacklist)of the merchant(mchno) with same user(bacno) * special features-blacklist of transaction amount(cano) with same card(cano) * special features-the days(locdt) difference between the transaction and the first fraudulent transaction(if exist) * number of transaction for each cano in 7 days!!! * total transaction amount(conam) for each cano in 7days!!! * Number of consumption country(stocn)for each cano in 2days * number of scity/acqic/stocn/csmcu/etymd /hcefg/contp for each cano * correlation between data and data in similar time * num of transaction for each day * num of transaction - mean(transaction number for each day) * n of transaction for each bacno/acqic/cano * the average of conam for each bano * conam - the average of conam for each bano!!!!!!!!!!!!! * # 抽樣方法 * 依據時間切fold很重要，避免train出　overfit model * groupby bacno * group kfold # 建模技巧 * xgboost / catboost / lgbm / * 用到ap * 將一些 training sets出現的但testing sets未出現的類別變數當成NA * binary encoding * pseudo labeling * autoencoder not fraudata and if data cant decode properly(input -output>threshold ) will be detremine as fraud data # 疑問 * 洩漏資料(換cano / stscd連續且唯一的2 / ecfg連續且唯一的1) ![](https://i.imgur.com/z6TaRMg.jpg) * binary encoding * entropy base regularization * boosting(avoid overfit) # 參考資料 1. ![](https://i.imgur.com/KrgKZDM.png) 2. ![](https://i.imgur.com/CzDUwQF.png) 3. ![](https://i.imgur.com/zunf7EA.jpg) 4. ![](https://i.imgur.com/QxgX1Sz.jpg) 5. threshold = 0.36 # Use base model for default df_test['fraud_ind'] = np.where(df_test['sub_base_model']> threshold, 1, 0) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==1) & (df_test['sub_fraud_mchno_model'] > threshold), 1, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==1) & (df_test['sub_fraud_mchno_model'] <= threshold), 0, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['diff_with_first_fraud_locdt'] >= 1) & (df_test['sub_first_fraud_model'] > threshold), 1, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['mchno_in_normal_mchno_list']>0) & (df_test['sub_normal_mchno_model'] > threshold), 1, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['mchno_in_normal_mchno_list']>0) & (df_test['sub_normal_mchno_model'] <= threshold), 0, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['conam_in_fraud_conam_list']==1) & (df_test['sub_fraud_conam_model'] > threshold), 1, df_test['fraud_ind']) df_test['fraud_ind'] = np.where((df_test['mchno_in_fraud_mchno_list']==0) & (df_test['conam_in_fraud_conam_list']==1) & (df_test['sub_fraud_conam_model'] <= threshold), 0, df_test['fraud_ind']) df_test[['txkey','fraud_ind']].to_csv('sub_{}.csv'.format(threshold),index = False) special_feautures = [ 'mchno_in_normal_mchno_list',# 這間特電在過去的交易中有出現且是正常的 'mchno_in_fraud_mchno_list',# 這間特電在過去的交易中有出現且是盜刷的 'conam_in_fraud_conam_list',# 金額在過去的交易中有出現且是異常的 'diff_with_first_fraud_locdt'#與該卡號第一次被判盜刷距今的交易時間 ]