Recap

Basic of sklearn

  1. Estimator: The core of scikit-learn's interface. Any object that can estimate some parameters based on a dataset is called an estimator. For example, an estimator might be a classifier or a regressor. The estimation itself is performed by the fit() method.

  2. Predictor: A type of estimator that, once trained, is able to make predictions on new, unseen data. Predictors implement a predict() method. Most classifiers and regressors in scikit-learn are predictors.

  3. Transformer: A type of estimator that can change or "transform" the data. A transformer might clean or preprocess the data in some way. It has a transform() method and often also a fit() method if the transformation needs to learn from the data (e.g., StandardScaler).

  4. Model: In scikit-learn, once an estimator (like a classifier or regressor) has been trained using the fit() method, it becomes a model. The score() method can then be used to evaluate how well the model performs on a given test dataset.

  5. Pipeline: A way to streamline a lot of the routine processes in machine learning. A Pipeline bundles together a sequence of data processing steps and modeling, where each step is represented by a tuple (name, transform/predict object). Each step in the pipeline must be a transformer, except for the last step which can be of any type (transformer, predictor).

Term Definition Key Methods Typical Uses
Estimator Any object that can estimate parameters based on a dataset. fit() Classification, Regression
Predictor An estimator that makes predictions on new, unseen data. fit(), predict() Making predictions on new data
Transformer An object that can transform a dataset. Often used for data preprocessing or feature engineering. fit(), transform() Data preprocessing, Feature engineering
Model A specific instance of an estimator that has been trained on data. Often used to evaluate how well it performs. fit(), predict(), score() Evaluation of the performance, Testing, Validation
Pipeline Streamlines processes by bundling together a sequence of data processing steps and modeling. fit(), transform(), predict() Sequencing multiple steps in modeling

See here or here for more in-depth discussion.

Linear regression

Absolutely. Let's expand the table to include the handling of interaction terms, polynomial features, and categorical variables in both scikit-learn and statsmodels.

Feature/Aspect scikit-learn statsmodels
Library Import from sklearn.linear_model import LinearRegression import statsmodels.api as sm
from statsmodels.formula.api import ols (for formula-based interface)
Model Creation model = LinearRegression() model = sm.OLS(y, X)
Or model = ols("y ~ X", data) (for formula-based interface)
Fitting Model model.fit(X, y) results = model.fit()
Prediction predictions = model.predict(X_new) predictions = results.predict(X_new)
Coefficients & Intercept model.coef_, model.intercept_ results.params
Model Summary Not provided directly print(results.summary())
Residuals Computed as y - predictions results.resid
R-squared Value model.score(X, y) results.rsquared
Hypothesis Testing Not provided directly Available in results.summary()
Model Diagnostics Basic metrics. Advanced diagnostics need manual computation. Extensive diagnostics provided
Intercept Handling LinearRegression fits an intercept by default. Manually add a constant with sm.add_constant(X)
Order of X and y X first, then y y first, then X
Formula-based Approach Not supported Supported: ols("y ~ X1 + X2", data).fit()
Formula Intercept N/A Automatically includes intercept. Use - 1 to exclude it in formula.
Interaction Terms Need manual creation (e.g., X['interaction'] = X['col1'] * X['col2']) Supported in formula: ols("y ~ X1*X2", data).fit()
Polynomial Features Use from sklearn.preprocessing import PolynomialFeatures Manually create or use I() in formula: ols("y ~ I(X1**2)", data).fit()
Categorical Variables Need one-hot encoding using from sklearn.preprocessing import OneHotEncoder Automatic handling in formula: ols("y ~ C(category_col)", data).fit()
Main Focus Prediction Statistical inference

This table now includes the additional functionalities you mentioned. It provides a clearer perspective on how to handle these features in both libraries.

Logistic regression

Feature/Aspect scikit-learn (LogisticRegression) statsmodels (GLM with Binomial family)
Library Import from sklearn.linear_model import LogisticRegression import statsmodels.api as sm
from statsmodels.formula.api import glm (for formula-based interface)
Model Creation model = LogisticRegression(C=1e9) (to approximate no regularization) model = sm.GLM(y, X, family=sm.families.Binomial())
Or model = glm("y ~ X", data, family=sm.families.Binomial()).fit() (for formula-based interface)
Fitting Model model.fit(X, y) results = model.fit()
Prediction (probabilities) probs = model.predict_proba(X_new)[:,1] probs = results.predict(X_new)
Class Prediction classes = model.predict(X_new) classes = (probs > 0.5).astype(int)
Coefficients & Intercept model.coef_, model.intercept_ results.params
Model Summary Not provided directly print(results.summary())
Deviance, AIC, BIC Not provided directly, can be computed via here Available in results.summary()
Intercept Handling By default, fits an intercept. Manually add a constant using sm.add_constant(X)
Order of X and y X first, then y y first, then X
Formula-based Approach Not supported Supported: glm("y ~ X1 + X2", data, family=sm.families.Binomial()).fit()
Interaction Terms Need manual creation Formula-based: Automatic with glm("y ~ X1*X2", data, family=sm.families.Binomial()).fit(). For non-formula: Manual creation needed.
Formula Intercept N/A Automatically includes intercept. Use - 1 to exclude it in formula.
Polynomial Features Use from sklearn.preprocessing import PolynomialFeatures Formula-based: Supported with glm("y ~ I(X1**2)", data, family=sm.families.Binomial()).fit(). For non-formula: Manual creation needed.
Categorical Variables Need one-hot encoding using from sklearn.preprocessing import OneHotEncoder Formula-based: Automatic with glm("y ~ C(category_col)", data, family=sm.families.Binomial()).fit(). For non-formula: Manual preprocessing needed.

Diagnostics plot and formula to add nonideal terms

Regularization

Feature/Aspect Logistic Regression Lasso (L1) / Ridge (L2)
Purpose Classification Regression
Library/Module from sklearn.linear_model import LogisticRegression from sklearn.linear_model import Lasso, Ridge
Regularization Supports L1, L2, and ElasticNet (combination of L1 & L2) Lasso: L1 regularization; Ridge: L2 regularization
Parameter for Regularization C (Inverse of regularization strength) alpha (Regularization strength)
Loss Function Cross-entropy Loss Lasso: Mean Absolute Error; Ridge: Mean Squared Error
Feature Selection L1 can lead to feature selection Lasso can lead to feature selection
Hyperparameter Tuning Mostly C and penalty (l1, l2, elasticnet) For Lasso: mainly alpha; For Ridge: alpha
Solver Options Supports multiple solvers like liblinear, saga etc. Generally uses coordinate descent for Lasso and closed-form solution or iterative solvers for Ridge
Use Case When you have a binary/multi-class classification problem When you have a regression problem and want to incorporate regularization
Cross-Validation Use LogisticRegressionCV for cross-validation. Parameters include Cs (list of regularization strengths to try), cv (number of cross-validation folds), and penalty. Use LassoCV for Lasso and RidgeCV for Ridge. Both allow you to specify alphas (list of regularization strengths to try) and cv (number of cross-validation folds).
Common Pitfalls for Newbies 1. Not scaling features can heavily impact performance.
2. Over-relying on default hyperparameters.
3. Misinterpreting the C parameter (smaller C means stronger regularization).
1. Not scaling features, which can greatly affect the regularization's effectiveness.
2. Ignoring multicollinearity which Ridge can handle but Lasso may struggle with.
3. Not checking if some important features are completely removed by Lasso.

Cross validation

In scikit-learn, there are functions and classes that deal with cross-validation.

1. cross_val_score

Purpose: Computes the score for each CV split. It is a quick utility function to get a feel for a model's performance and only allow single metric.

Usage:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Use cross_val_score to get R^2 scores from 5-fold CV
scores = cross_val_score(model, X, y, cv=5)

2. cross_validate

Purpose: This is an extension of cross_val_score. It allows you to specify multiple metrics for evaluation and can also return train scores, fit times, and score times.

Usage:

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Use cross_validate to get training and testing scores using 5-fold CV
results = cross_validate(model, X, y, cv=5, return_train_score=True, scoring=['r2', 'neg_mean_squared_error'])

3. Cross-validation Iterators

Purpose: These are classes that generate indices to split data into train/test sets. They can be used with any custom loop or framework, giving you more flexibility over the cross-validation process.

Usage:

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

# Create a 5-fold cross-validation iterator
kf = KFold(n_splits=5, shuffle=True,
            random_state=2023)

# Create a linear regression model
model = LinearRegression()

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(score)

Summary:

  • cross_val_score is a quick and simple function to get model scores over multiple CV splits.
  • cross_validate is a more comprehensive function that can provide multiple metrics, train scores, and time metrics.
  • Cross-validation iterators like KFold give you flexibility and control over the CV process, allowing you to integrate it with any custom framework or loop.
  • You can use different metrics for scoring.

Lectures

Basics

Libraries

Pandas and matplotlib

Sklearn

Statsmodel

Overview

As data scientists, we know that computers are great at aiding in repetitive tasks

  • We have a vast range of tools available at our fingertips that enable us to be more productive and solve more complex problems when working on any computer-related problem
  • Yet many of us utilize only a tiny fraction of those tools; In this mini-course, I will try my best to help you become familiar with what kind of tools may be useful in your research
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

source:https://www.goodreads.com/book/show/29437996-copying-and-pasting-from-stack-overflow

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

source:https://twitter.com/DataChaz/status/1642892653124624390/photo/1

Search tips