Subscription Prediction

# Subscription Prediction >Main compiler: Cynthia >Motivation: gain valuable insights for subscription prediction >Research note: > [2020-03-19] * prediction of transactions (remove ISP and auto-subscription users) * xgboost model * data preprocess: shift variables according to their correlation * period A (12 weeks) * ![](https://i.imgur.com/PumaXLA.png) * period B (12 weeks) * ![](https://i.imgur.com/kfatw3c.png) * pearson correlation with transactions (remove ISP and auto-subscription users) (2010~2019) ![](https://i.imgur.com/noimlCQ.png) * pearson correlation with transactions (remove ISP and auto-subscription users) (2017~2019) ![](https://i.imgur.com/RUed8YK.png) * pearson correlation with transactions (remove ISP and auto-subscription users) (2019~) ![](https://i.imgur.com/7bM6RKP.png) [2020-02-27] * total transactions is seasonal time series * split into user type: old / new user * discovery: old -> seasonal, new -> irregular * split into autorenew / non-autorenew * discovery: autorenew -> seasonal, non-autorenew or first transaction of autorenew -> irregular * autorenew: * ![](https://gitlab.kkinternal.com/researchcenter/rdc-image/raw/master/ds/cynthiayang/trans_renew.png) * non-autorenew or first transaction of autorenew: * ![](https://gitlab.kkinternal.com/researchcenter/rdc-image/raw/master/ds/cynthiayang/trans_nonrenew.png) [2020-02-20] * motivation: * predict the amount of conversion in advance (etc. couple months) * find the insights correlated to the amount * dataset * internal * downloads, registration, churn, transaction, user count, stream count * external * downloads of competitive Product * ptt / dcard discussion * commercial cost * app store ratings * fan page posts and likes * process (internal only) * data exploration * discovery: feature importance changes a lot between 2017, 2018 and 2019 * model * xgboost * VAR (vector autoregression) * rmse: about 10,000 (conversion: 40,000) * difficulties * features for now may not presentative enough for prediction * monthly data (36x80) is too few and may be biased * restart with weekly data * transactions prediction * model AR (lags: 24 weeks) * result: rmse: 41586.133 * ![](https://gitlab.kkinternal.com/researchcenter/rdc-image/raw/master/ds/cynthiayang/trans_prediction.png) * difference prediction * ![](https://gitlab.kkinternal.com/researchcenter/rdc-image/raw/master/ds/cynthiayang/trans_prediction_diff.png)