---
title: Logistic regression model
description: My machine learning journey start with logistic regression model
tags: data-science, analytical-model, data-analytic, model, propensity-model, logistic-regression, machine-learning, sklearn, smote
---
# Logistic Regression Model
Related post:
- [Propensity model 101](/daZbO7I_Q1adbPL8XopVFw)
**Questions need to answer:**
- What is it?
- What is it used for?
- What are its underlying assumptions?
- How to check multicollinearity between independent variables?
## Logistic Regression in Python
(Step by step)
**List references:**
- Medium: [Building a logistic regression in python step by step](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)
-
(More on my Evernote)
- Descriptively check the relationship between independent and dependent variables to have an overview on which variable might have a large explaination power on desired feature by visualization (check frequency by countplot, or scatter plot/histogram for continuous variables)
- Create dummies variables for each categorical feature
Example:
```python
cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(data[var], prefix=var)
data1=data.join(cat_list)
data=data1
cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
```
- Check whether the dataset is imbalanced
Which leads us to:
### Imbalanced dataset
**Definition**: There is a dominance in frequency for one category over other categories of target feature in your dataset.
**How to solve this problem**: two techniques
- Over-sampling method: SMOTE aka Synthetic Minority Over-sampling Technique
+ SMOTE create synthetic samples (not duplicate samples) of the minority class hence making the minority class equal to the majority class
+ SMOTE select similar record and alter that record one column at a time by a random amount within the difference to the neighbouring records (Original paper on SMOTE, published in 2002, [here](https://jair.org/index.php/jair/article/view/10302))
- Under-sampling method: NearMiss
+ Make the majority class equal to minority class
## Other questions
- What `random_state` is, in function `train_test_split(X, y, test_size=0.3, random_state=0)`
- Accuracy score in model expressed as `logreg.score(X_test, y_test)v` in which `logreg = LogisticRegression()`
- Confusion matrix
- Compute precision, recall, F-measure, and support ([Source](http://scikit-learn.org/stable/index.html))
- ROC curve ([receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic))
- Dotted line represents the ROC curve of a purely random classifier
- A good classifier stays as far away from that line as possible (toward the top-left corner)
Some important metrics to intepret data model:
- Classification Accuracy or **accuracy score**: is the ratio of number of correct predictions to the total number of input samples
+ Accuracy = Number of Correct predictions/Total number of prediction made
+ Accuracy score works well only if there are equal number of samples belonging to each class (which means the sample is balanced)
+ Common mistake with accuracy score: Easily gives us a false sense of achieving high accuracy
- Logarithmic loss or **log loss**: works by penalizing the false classification --> Would work well for multi-class classification
- Confusion matrix
- Area under curve (AUC): is used for binary classification problem
+ True Positive Rate (Sensitivity): TP/(FN+TP)
+ False Positive Rate (Specificity): FP/(FP+TN)
+ AUC has a range of [0, 1], the greater the value, the better is the performance of our model
- F1 Score: is used to measure the test's accuracy, is the harmonic mean between precision and recall, with range of [0, 1]
+ F1 Score = how precise your classifier is as well as how robust it is
+ The greater the F1 Score, the better is the performance of our model
+ F1 = 2 * 1/(1/precision + 1/recall)
- Recall: is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive)
+ Recall = True Positives/(True Positives + False Negatives)
- Precision: is the number of correct positive results divided by the number of positive results predicted by the classifier
+ Precision = True Positives/(True Positives + False Positives)
- Mean Absolute Error: is the average of the difference between the Original Values and the Predicted Values --> how far the predictions were from the actual output
- Mean squared error (MSE): similar to mean absolute error but is squared
References for
- [Metrics to evaluate your machine learning algorithm](https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234)
- [Accuracy, Precision, Recall and F1 score interpretation of performance measures](https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/)
- [Module sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
- [How to score probability predictions in Python](https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/)
- [How to Make predictions with scikit-learn](https://machinelearningmastery.com/make-predictions-scikit-learn/)