Logistic regression model

--- title: Logistic regression model description: My machine learning journey start with logistic regression model tags: data-science, analytical-model, data-analytic, model, propensity-model, logistic-regression, machine-learning, sklearn, smote --- # Logistic Regression Model Related post: - [Propensity model 101](/daZbO7I_Q1adbPL8XopVFw) **Questions need to answer:** - What is it? - What is it used for? - What are its underlying assumptions? - How to check multicollinearity between independent variables? ## Logistic Regression in Python (Step by step) **List references:** - Medium: [Building a logistic regression in python step by step](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8) - (More on my Evernote) - Descriptively check the relationship between independent and dependent variables to have an overview on which variable might have a large explaination power on desired feature by visualization (check frequency by countplot, or scatter plot/histogram for continuous variables) - Create dummies variables for each categorical feature Example: ```python cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] for var in cat_vars: cat_list='var'+'_'+var cat_list = pd.get_dummies(data[var], prefix=var) data1=data.join(cat_list) data=data1 cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] data_vars=data.columns.values.tolist() to_keep=[i for i in data_vars if i not in cat_vars] ``` - Check whether the dataset is imbalanced Which leads us to: ### Imbalanced dataset **Definition**: There is a dominance in frequency for one category over other categories of target feature in your dataset. **How to solve this problem**: two techniques - Over-sampling method: SMOTE aka Synthetic Minority Over-sampling Technique + SMOTE create synthetic samples (not duplicate samples) of the minority class hence making the minority class equal to the majority class + SMOTE select similar record and alter that record one column at a time by a random amount within the difference to the neighbouring records (Original paper on SMOTE, published in 2002, [here](https://jair.org/index.php/jair/article/view/10302)) - Under-sampling method: NearMiss + Make the majority class equal to minority class ## Other questions - What `random_state` is, in function `train_test_split(X, y, test_size=0.3, random_state=0)` - Accuracy score in model expressed as `logreg.score(X_test, y_test)v` in which `logreg = LogisticRegression()` - Confusion matrix - Compute precision, recall, F-measure, and support ([Source](http://scikit-learn.org/stable/index.html)) - ROC curve ([receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)) - Dotted line represents the ROC curve of a purely random classifier - A good classifier stays as far away from that line as possible (toward the top-left corner) Some important metrics to intepret data model: - Classification Accuracy or **accuracy score**: is the ratio of number of correct predictions to the total number of input samples + Accuracy = Number of Correct predictions/Total number of prediction made + Accuracy score works well only if there are equal number of samples belonging to each class (which means the sample is balanced) + Common mistake with accuracy score: Easily gives us a false sense of achieving high accuracy - Logarithmic loss or **log loss**: works by penalizing the false classification --> Would work well for multi-class classification - Confusion matrix - Area under curve (AUC): is used for binary classification problem + True Positive Rate (Sensitivity): TP/(FN+TP) + False Positive Rate (Specificity): FP/(FP+TN) + AUC has a range of [0, 1], the greater the value, the better is the performance of our model - F1 Score: is used to measure the test's accuracy, is the harmonic mean between precision and recall, with range of [0, 1] + F1 Score = how precise your classifier is as well as how robust it is + The greater the F1 Score, the better is the performance of our model + F1 = 2 * 1/(1/precision + 1/recall) - Recall: is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive) + Recall = True Positives/(True Positives + False Negatives) - Precision: is the number of correct positive results divided by the number of positive results predicted by the classifier + Precision = True Positives/(True Positives + False Positives) - Mean Absolute Error: is the average of the difference between the Original Values and the Predicted Values --> how far the predictions were from the actual output - Mean squared error (MSE): similar to mean absolute error but is squared References for - [Metrics to evaluate your machine learning algorithm](https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234) - [Accuracy, Precision, Recall and F1 score interpretation of performance measures](https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/) - [Module sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - [How to score probability predictions in Python](https://machinelearningmastery.com/how-to-score-probability-predictions-in-python/) - [How to Make predictions with scikit-learn](https://machinelearningmastery.com/make-predictions-scikit-learn/)