# Churn Modeling - Journal d'Experience 📝
###### tags: `projects`, `yotta`
[TOC]
**note:** to run commands, `alias py='poetry run python'`
Global params_set:
- beta: `2`
## 0.0 Absolute baseline: random classifier
params set:
- strategy: `'uniform'`
```bash
py -m churn.app.train data/train/ --seed 0 --model DummyClassifier --name 00_random
```
```bash
INFO [train]: fbeta score on train: 0.3966
INFO [train]: fbeta score on validation: 0.3757
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.80 0.49 0.61 1701
True 0.19 0.50 0.28 414
accuracy 0.49 2115
macro avg 0.50 0.49 0.44 2115
weighted avg 0.68 0.49 0.54 2115
```
## 1.0 Baseline: LogisticRegression
Pipeline:
- `SimpleImputer` (num:mean | cat:unknown)
- `StandardScaler` (num)
- `OneHotEncoder` (cat)
- `LogisticRegression`
Features added:
- `'uw_year'`
- `'uw_month'`
Features dropped:
- `'uw_date'`
- `'name'`
params set: (defaults)
- max_iter: `300`
- penalty: `'l2'`
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 10_baseline
```
#### without outliers imputing
```bash
INFO [train]: fbeta score on train: 0.2288
INFO [train]: fbeta score on validation: 0.2256
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.83 0.97 0.89 1701
True 0.58 0.20 0.29 414
accuracy 0.82 2115
macro avg 0.71 0.58 0.59 2115
weighted avg 0.78 0.82 0.78 2115
```
#### with outliers imputing
Pipeline:
- `MaxAbsImputer()` (num: `[age, nb_products]`)
```bash
INFO [train]: fbeta score on train: 0.2479
INFO [train]: fbeta score on validation: 0.2369
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.83 0.96 0.89 1701
True 0.54 0.21 0.30 414
accuracy 0.81 2115
macro avg 0.69 0.58 0.60 2115
weighted avg 0.78 0.81 0.77 2115
```
### 1.1 Baseline + Poly
Pipeline:
- `PolynomialFeatures(N)`
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 11_baseline_poly
```
#### PolynomialFeatures(2)
```bash
INFO [train]: fbeta score on train: 0.4884
INFO [train]: fbeta score on validation: 0.4846
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.88 0.95 0.91 1701
True 0.71 0.45 0.55 414
accuracy 0.86 2115
macro avg 0.79 0.70 0.73 2115
weighted avg 0.84 0.86 0.84 2115
```
#### PolynomialFeatures(3)
```bash
INFO [train]: fbeta score on train: 0.5251
INFO [train]: fbeta score on validation: 0.5162
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.88 0.95 0.91 1701
True 0.69 0.49 0.57 414
accuracy 0.86 2115
macro avg 0.79 0.72 0.74 2115
weighted avg 0.85 0.86 0.85 2115
```
#### PolynomialFeatures(4)
params set:
- max_iter: `1500`
```bash
INFO [train]: fbeta score on train: 0.5432
INFO [train]: fbeta score on validation: 0.5256
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.89 0.95 0.92 1701
True 0.70 0.50 0.58 414
accuracy 0.86 2115
macro avg 0.79 0.72 0.75 2115
weighted avg 0.85 0.86 0.85 2115
```
#### PolynomialFeatures(5)
params set:
- max_iter: `2000`
```bash
INFO [train]: fbeta score on train: 0.5585
INFO [train]: fbeta score on validation: 0.5238
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.88 0.94 0.91 1701
True 0.68 0.50 0.57 414
accuracy 0.86 2115
macro avg 0.78 0.72 0.74 2115
weighted avg 0.84 0.86 0.85 2115
```
### 1.2 Baseline + Poly(4) + domain
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 12_baseline_poly_domain
```
#### `'balance_is_null'` (num)
```bash
INFO [train]: fbeta score on train: 0.5447
INFO [train]: fbeta score on validation: 0.5274
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.89 0.95 0.92 1701
True 0.69 0.50 0.58 414
accuracy 0.86 2115
macro avg 0.79 0.72 0.75 2115
weighted avg 0.85 0.86 0.85 2115
```
#### `'inverse_salary'` (num)
```bash
INFO [train]: fbeta score on train: 0.5463
INFO [train]: fbeta score on validation: 0.5133
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.88 0.95 0.91 1701
True 0.68 0.48 0.57 414
accuracy 0.86 2115
macro avg 0.78 0.71 0.74 2115
weighted avg 0.84 0.86 0.85 2115
```
### 1.3 Baseline + Poly(4) + Domain + PCA
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 13_baseline_poly_domain_pca
```
#### PCA -> Poly
Pipeline:
- `PCA` (num)
- `PolynomialFeatures(4)` (num)
```bash
INFO [train]: fbeta score on train: 0.5439
INFO [train]: fbeta score on validation: 0.5274
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.89 0.95 0.92 1701
True 0.69 0.50 0.58 414
accuracy 0.86 2115
macro avg 0.79 0.72 0.75 2115
weighted avg 0.85 0.86 0.85 2115
```
#### Poly -> PCA
Pipeline:
- `PolynomialFeatures(4)` (num)
- `PCA` (num)
```bash
INFO [train]: fbeta score on train: 0.5446
INFO [train]: fbeta score on validation: 0.5297
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.89 0.95 0.92 1701
True 0.69 0.50 0.58 414
accuracy 0.86 2115
macro avg 0.79 0.72 0.75 2115
weighted avg 0.85 0.86 0.85 2115
```
### 1.4 Baseline + Poly + Domain + PCA + Balancing
params set:
- class_weight: `'balanced'`
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 14_baseline_poly_domain_pca_balancing
```
```bash
INFO [train]: fbeta score on train: 0.7009
INFO [train]: fbeta score on validation: 0.6721
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.78 0.85 1701
True 0.46 0.76 0.57 414
accuracy 0.78 2115
macro avg 0.70 0.77 0.71 2115
weighted avg 0.84 0.78 0.80 2115
```
### 1.5 Baseline + Poly + Domain + PCA + Balancing + GridSearch(cv=5)
params_set:
- cv: `5`
- max_iter: `1500`
- penalty: `'l2'`
- class_weight:`'balanced'`
- C:
- `1.e-3`
- `1.e-2`
- `1.e-1`
- `1.e-0`
- `1.e+1`
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 15_baseline_poly_domain_pca_balancing_gridsearch
```
```bash
INFO [train]: best parameters found:
INFO [train]: C: 10.0
INFO [train]: class_weight: balanced
INFO [train]: max_iter: 1500
INFO [train]: penalty: l2
INFO [train]: fbeta score on train: 0.7008
INFO [train]: fbeta score on validation: 0.6745
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.79 0.85 1701
True 0.46 0.76 0.58 414
accuracy 0.78 2115
macro avg 0.70 0.77 0.71 2115
weighted avg 0.84 0.78 0.80 2115
```
### 1.6 Baseline + Poly + Domain + PCA + Balancing + GridSearch(cv=5) + ElasticNet
params:
- cv: `5`
- max_iter: `2000`
- class_weight: `'balanced'`
- solver: `'saga'`
- penalty: `'elasticnet'`
- l1_ratio:
- `0`
- `.25`
- `.5`
- `.75`
- `1`
- C:
- `1.e-3`
- `1.e-2`
- `1.e-1`
- `1.e-0`
- `1.e+1`
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --name 16_baseline_poly_domain_pca_balancing_gridsearch_elasticnet
```
:warning: Did not converge warnings
```bash
INFO [train]: best parameters found:
INFO [train]: C: 1.0
INFO [train]: class_weight: balanced
INFO [train]: l1_ratio: 0.5
INFO [train]: max_iter: 2000
INFO [train]: penalty: elasticnet
INFO [train]: solver: saga
INFO [train]: fbeta score on train: 0.6957
INFO [train]: fbeta score on validation: 0.6602
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.79 0.85 1701
True 0.46 0.74 0.57 414
accuracy 0.78 2115
macro avg 0.69 0.76 0.71 2115
weighted avg 0.83 0.78 0.80 2115
```
### 1.7 Baseline + Poly + Domain + PCA + Balancing + Optuna(cv=5)
#### large search
params set:
- n_iter: `100`
- cv: `5`
- max_iter:`1000`
- solver: `'lbfgs'`
- penalty: `'l2'`
- class_weight: `'balanced'`
- C: [`1.e-4`, `1.e+4`] (log-uniform)
- tol: [`1.e-6`, `1.e-1`] (log-uniform)
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --optimizer optuna --name 17_baseline_poly_domain_pca_balancing_optuna
```
```bash
INFO [train]: best parameters found:
INFO [train]: max_iter: 1000
INFO [train]: penalty: l2
INFO [train]: solver: lbfgs
INFO [train]: class_weight: balanced
INFO [train]: C: 8.205190044532984
INFO [train]: tol: 0.0038608177813462497
INFO [train]: fbeta score on train: 0.7009
INFO [train]: fbeta score on validation: 0.6727
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.79 0.85 1701
True 0.46 0.76 0.58 414
accuracy 0.78 2115
macro avg 0.70 0.77 0.71 2115
weighted avg 0.84 0.78 0.80 2115
```
#### thin search
params set:
- C: [`10`, `10.25`] (uniform)
- tol: [`.9e-4`, `.925e-4`] (uniform)
```bash
INFO [train]: fbeta score on train: 0.7002
INFO [train]: fbeta score on validation: 0.6745
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.79 0.85 1701
True 0.46 0.76 0.58 414
accuracy 0.78 2115
macro avg 0.70 0.77 0.71 2115
weighted avg 0.84 0.78 0.80 2115
```
### 1.8 Baseline + Poly + Domain + PCA + Balancing + Optuna(cv=5) + ElasticNet
```bash
py -m churn.app.train data/train/ --seed 0 --model LogisticRegression --optimizer optuna --name 18_baseline_poly_domain_pca_balancing_optuna_elasticnet
```
params set:
- cv: `5`
- max_iter:`1000`
- solver: `'lbfgs'`
- penalty: `'l2'`
- class_weight: `'balanced'`
- C: [`10`, `10.25`] (uniform)
- tol: [`.9e-4`, `.925e-4`] (uniform)
:warning: Did not converge warnings
:black_square_for_stop: Killed early because iterations were too long whithout improvements being there.
## 2.0 SVC
### 2.1 SVC Baseline: Linear + Domain + PCA + Balancing + Optuna(cv=5)
Pipeline:
- `PolynomialFeatures(4)` removed
params set:
- n_iter: `30`
- cv: `5`
- kernel:`linear`
- class_weight: `'balanced'`
- C: [`1.e-4`, `1.e+4`] (log-uniform)
```bash
py -m churn.app.train data/train/ --seed 0 --model SVC --optimizer optuna --name 20_svc_linear
```
:black_square_for_stop: Killed early because iterations were too long whithout improvements.
```bash
```
### 2.1 SVC Baseline + Poly(4)
Pipeline:
- `PolynomialFeatures(4)` added
params set:
- n_iter: `30`
- cv: `5`
- kernel:`poly`
- degree:[`1`, `8`] (log-uniform)
- class_weight: `'balanced'`
- C: [`1.e-4`, `1.e+4`] (log-uniform)
- gamma:
```bash
py -m churn.app.train data/train/ --seed 0 --model SVC --optimizer optuna --name 21_svc_poly
```
:black_square_for_stop: Manually pruned because iterations were too long whithout improvements being there. Polynomials don't help SVC.
```bash
```
### 2.2 SVC Poly + Numerical features only
Pipeline:
- `PolynomialFeatures(4)` removed
- `num_pipeline` features only
params set:
- n_iter: `30`
- cv: `5`
- kernel:`poly`
- degree:[`1`, `8`] (uniform)
- class_weight: `'balanced'`
- C: [`1.e-4`, `1.e+4`] (log-uniform)
- gamma: [`1.e-4`, `1.e+1`] (log-uniform)
```bash
py -m churn.app.train data/train/ --seed 0 --model SVC --optimizer optuna --name 22_svc_poly_num
```
```bash
INFO [train]: best parameters found:
INFO [train]: kernel: poly
INFO [train]: class_weight: balanced
INFO [train]: C: 0.00016704116949323001
INFO [train]: degree: 4
INFO [train]: gamma: 6.720500489672588
INFO [train]: fbeta score on train: 0.5406
INFO [train]: fbeta score on validation: 0.5086
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.88 0.81 0.84 1701
True 0.41 0.54 0.47 414
accuracy 0.76 2115
macro avg 0.64 0.68 0.66 2115
weighted avg 0.79 0.76 0.77 2115
```
## 3.0 RandomForest Baseline: Linear + Domain + PCA + Balancing + Optuna(cv=5)
```bash
py -m churn.app.train data/train/ --seed 0 --model RandomForestClassifier --optimizer optuna --name 30_randomforest
```
#### large search
Pipeline:
- `PolynomialFeatures(4)` removed
params set:
- n_iter: `30`
- cv: `5`
- criterion:`gini`
- class_weight: `'balanced'`
- max_features: `'auto'`
- max_depth: [`3`, `10`] (uniform)
- min_samples_leaf: [`2`, `5`] (uniform)
- min_samples_split: [`2`, `5`] (uniform)
- n_estimators: [`50`, `250`] (uniform)
```bash
INFO [train]: best parameters found:
INFO [train]: bootstrap: l2
INFO [train]: criterion: gini
INFO [train]: class_weight: balanced
INFO [train]: max_depth: 6
INFO [train]: max_features: auto
INFO [train]: min_samples_leaf: 4
INFO [train]: min_samples_split: 4
INFO [train]: n_estimators: 116
INFO [train]: fbeta score on train: 0.7193
INFO [train]: fbeta score on validation: 0.6610
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.78 0.85 1701
True 0.45 0.75 0.56 414
accuracy 0.77 2115
macro avg 0.69 0.76 0.70 2115
weighted avg 0.83 0.77 0.79 2115
```


#### thin search
Pipeline:
- `PolynomialFeatures(4)` removed
params set:
- n_iter: `30`
- cv: `5`
- criterion:`gini`
- class_weight: `'balanced'`
- max_features: `'auto'`
- max_depth: [`2`, `6`] (uniform)
- min_samples_leaf: [`3`, `4`] (uniform)
- min_samples_split: [`3`, `4`] (uniform)
- n_estimators: [`50`, `500`] (uniform)
```bash
INFO [train]: best parameters found:
INFO [train]: bootstrap: l2
INFO [train]: criterion: gini
INFO [train]: class_weight: balanced
INFO [train]: max_depth: 6
INFO [train]: max_features: auto
INFO [train]: min_samples_leaf: 4
INFO [train]: min_samples_split: 4
INFO [train]: n_estimators: 122
INFO [train]: fbeta score on train: 0.7179
INFO [train]: fbeta score on validation: 0.6618
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.78 0.85 1701
True 0.45 0.75 0.56 414
accuracy 0.77 2115
macro avg 0.69 0.76 0.71 2115
weighted avg 0.83 0.77 0.79 2115
```
#### large search bis
Pipeline:
- `PolynomialFeatures(4)` removed
params set:
- n_iter: `50`
- cv: `5`
- criterion:`gini`
- class_weight: `'balanced'`
- max_features: `'auto'`
- max_depth: [`2`, `8`] (uniform)
- min_samples_leaf: [`3`, `5`] (uniform)
- min_samples_split: [`3`, `5`] (uniform)
- n_estimators: [`50`, `500`] (uniform)
```bash
INFO [train]: best parameters found:
INFO [train]: bootstrap: l2
INFO [train]: criterion: gini
INFO [train]: class_weight: balanced
INFO [train]: max_depth: 6
INFO [train]: max_features: auto
INFO [train]: min_samples_leaf: 5
INFO [train]: min_samples_split: 4
INFO [train]: n_estimators: 313
INFO [train]: fbeta score on train: 0.7199
INFO [train]: fbeta score on validation: 0.6652
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.93 0.78 0.85 1701
True 0.45 0.75 0.57 414
accuracy 0.77 2115
macro avg 0.69 0.77 0.71 2115
weighted avg 0.84 0.77 0.79 2115
```
### 3.1 RandomForest + Poly(4) + Domain + PCA + Balancing + Optuna(cv=5)
Pipeline:
- `PolynomialFeatures(4)` added
params set:
- n_iter: `50`
- cv: `5`
- criterion:`gini`
- class_weight: `'balanced'`
- max_features: `'auto'`
- max_depth: [`3`, `8`] (uniform)
- min_samples_leaf: [`3`, `5`] (uniform)
- min_samples_split: [`3`, `5`] (uniform)
- n_estimators: [`50`, `500`] (uniform)
```bash
py -m churn.app.train data/train/ --seed 0 --model RandomForestClassifier --optimizer optuna --name 31_randomforest_poly
```
:black_square_for_stop: Manually pruned because iterations were too long whithout improvements being there. Polynomials don't help trees.
### 3.2 RandomForest + Linear + Domain + Balancing + Optuna(cv=5)
Pipeline:
- `PolynomialFeatures(4)` removed
- `PCA` removed
params set:
- n_iter: `50`
- cv: `5`
- criterion:`gini`
- class_weight: `'balanced'`
- max_features: `'auto'`
- max_depth: [`3`, `8`] (uniform)
- min_samples_leaf: [`3`, `5`] (uniform)
- min_samples_split: [`3`, `5`] (uniform)
- n_estimators: [`50`, `500`] (uniform)
```bash
py -m churn.app.train data/train/ --seed 0 --model RandomForestClassifier --optimizer optuna --name 32_randomforest_poly_nopca
```
```bash
INFO [train]: fbeta score on train: 0.6979
INFO [train]: fbeta score on validation: 0.6314
INFO [train]: classification_report (validation):
precision recall f1-score support
False 0.91 0.82 0.87 1701
True 0.48 0.68 0.57 414
accuracy 0.80 2115
macro avg 0.70 0.75 0.72 2115
weighted avg 0.83 0.80 0.81 2115
```
## 4.0 GradientBoosting Baseline: Linear + Domain + PCA + Balancing + Optuna(cv=5)
```bash
py -m churn.app.train data/train/ --seed 0 --model GradientBoostingClassifier --optimizer optuna --name 40_gradientboosting
```
### large search
Pipeline:
- `PolynomialFeatures(4)` removed
params set:
- n_iter: `30`
- cv: `5`
- criterion:`friedman_mse`
- learning_rate: [`1.e-4`, `1.e+1`] (log-uniform)
- max_features: `'auto'`
- max_depth: [`2`, `8`] (uniform)
- min_samples_leaf: [`3`, `5`] (uniform)
- min_samples_split: [`3`, `5`] (uniform)
- subsample: [`.3`, `.7`] (uniform)
- n_estimators: [`50`, `300`] (uniform)
```bash
```
## 5.0 XGBoost