sklearn
Estimator: The core of scikit-learn's interface. Any object that can estimate some parameters based on a dataset is called an estimator. For example, an estimator
might be a classifier or a regressor. The estimation itself is performed by the fit()
method.
Predictor: A type of estimator that, once trained, is able to make predictions on new, unseen data. Predictors implement a predict()
method. Most classifiers and regressors in scikit-learn are predictors.
Transformer: A type of estimator that can change or "transform" the data. A transformer might clean or preprocess the data in some way. It has a transform()
method and often also a fit()
method if the transformation needs to learn from the data (e.g., StandardScaler
).
Model: In scikit-learn, once an estimator (like a classifier or regressor) has been trained using the fit()
method, it becomes a model. The score()
method can then be used to evaluate how well the model performs on a given test dataset.
Pipeline: A way to streamline a lot of the routine processes in machine learning. A Pipeline
bundles together a sequence of data processing steps and modeling, where each step is represented by a tuple (name
, transform
/predict
object). Each step in the pipeline must be a transformer, except for the last step which can be of any type (transformer, predictor).
Term | Definition | Key Methods | Typical Uses |
---|---|---|---|
Estimator | Any object that can estimate parameters based on a dataset. | fit() |
Classification, Regression |
Predictor | An estimator that makes predictions on new, unseen data. | fit() , predict() |
Making predictions on new data |
Transformer | An object that can transform a dataset. Often used for data preprocessing or feature engineering. | fit() , transform() |
Data preprocessing, Feature engineering |
Model | A specific instance of an estimator that has been trained on data. Often used to evaluate how well it performs. | fit() , predict() , score() |
Evaluation of the performance, Testing, Validation |
Pipeline | Streamlines processes by bundling together a sequence of data processing steps and modeling. | fit() , transform() , predict() |
Sequencing multiple steps in modeling |
See here or here for more in-depth discussion.
Absolutely. Let's expand the table to include the handling of interaction terms, polynomial features, and categorical variables in both scikit-learn
and statsmodels
.
Feature/Aspect | scikit-learn | statsmodels |
---|---|---|
Library Import | from sklearn.linear_model import LinearRegression |
import statsmodels.api as sm from statsmodels.formula.api import ols (for formula-based interface) |
Model Creation | model = LinearRegression() |
model = sm.OLS(y, X) Or model = ols("y ~ X", data) (for formula-based interface) |
Fitting Model | model.fit(X, y) |
results = model.fit() |
Prediction | predictions = model.predict(X_new) |
predictions = results.predict(X_new) |
Coefficients & Intercept | model.coef_, model.intercept_ |
results.params |
Model Summary | Not provided directly | print(results.summary()) |
Residuals | Computed as y - predictions |
results.resid |
R-squared Value | model.score(X, y) |
results.rsquared |
Hypothesis Testing | Not provided directly | Available in results.summary() |
Model Diagnostics | Basic metrics. Advanced diagnostics need manual computation. | Extensive diagnostics provided |
Intercept Handling | LinearRegression fits an intercept by default. |
Manually add a constant with sm.add_constant(X) |
Order of X and y | X first, then y | y first, then X |
Formula-based Approach | Not supported | Supported: ols("y ~ X1 + X2", data).fit() |
Formula Intercept | N/A | Automatically includes intercept. Use - 1 to exclude it in formula. |
Interaction Terms | Need manual creation (e.g., X['interaction'] = X['col1'] * X['col2'] ) |
Supported in formula: ols("y ~ X1*X2", data).fit() |
Polynomial Features | Use from sklearn.preprocessing import PolynomialFeatures |
Manually create or use I() in formula: ols("y ~ I(X1**2)", data).fit() |
Categorical Variables | Need one-hot encoding using from sklearn.preprocessing import OneHotEncoder |
Automatic handling in formula: ols("y ~ C(category_col)", data).fit() |
Main Focus | Prediction | Statistical inference |
This table now includes the additional functionalities you mentioned. It provides a clearer perspective on how to handle these features in both libraries.
Feature/Aspect | scikit-learn (LogisticRegression ) |
statsmodels (GLM with Binomial family) |
---|---|---|
Library Import | from sklearn.linear_model import LogisticRegression |
import statsmodels.api as sm |
from statsmodels.formula.api import glm (for formula-based interface) |
||
Model Creation | model = LogisticRegression(C=1e9) (to approximate no regularization) |
model = sm.GLM(y, X, family=sm.families.Binomial()) |
Or model = glm("y ~ X", data, family=sm.families.Binomial()).fit() (for formula-based interface) |
||
Fitting Model | model.fit(X, y) |
results = model.fit() |
Prediction (probabilities) | probs = model.predict_proba(X_new)[:,1] |
probs = results.predict(X_new) |
Class Prediction | classes = model.predict(X_new) |
classes = (probs > 0.5).astype(int) |
Coefficients & Intercept | model.coef_ , model.intercept_ |
results.params |
Model Summary | Not provided directly | print(results.summary()) |
Deviance, AIC, BIC | Not provided directly, can be computed via here | Available in results.summary() |
Intercept Handling | By default, fits an intercept. | Manually add a constant using sm.add_constant(X) |
Order of X and y | X first, then y |
y first, then X |
Formula-based Approach | Not supported | Supported: glm("y ~ X1 + X2", data, family=sm.families.Binomial()).fit() |
Interaction Terms | Need manual creation | Formula-based: Automatic with glm("y ~ X1*X2", data, family=sm.families.Binomial()).fit() . For non-formula: Manual creation needed. |
Formula Intercept | N/A | Automatically includes intercept. Use - 1 to exclude it in formula. |
Polynomial Features | Use from sklearn.preprocessing import PolynomialFeatures |
Formula-based: Supported with glm("y ~ I(X1**2)", data, family=sm.families.Binomial()).fit() . For non-formula: Manual creation needed. |
Categorical Variables | Need one-hot encoding using from sklearn.preprocessing import OneHotEncoder |
Formula-based: Automatic with glm("y ~ C(category_col)", data, family=sm.families.Binomial()).fit() . For non-formula: Manual preprocessing needed. |
Feature/Aspect | Logistic Regression | Lasso (L1) / Ridge (L2) |
---|---|---|
Purpose | Classification | Regression |
Library/Module | from sklearn.linear_model import LogisticRegression |
from sklearn.linear_model import Lasso, Ridge |
Regularization | Supports L1, L2, and ElasticNet (combination of L1 & L2) | Lasso: L1 regularization; Ridge: L2 regularization |
Parameter for Regularization | C (Inverse of regularization strength) |
alpha (Regularization strength) |
Loss Function | Cross-entropy Loss | Lasso: Mean Absolute Error; Ridge: Mean Squared Error |
Feature Selection | L1 can lead to feature selection | Lasso can lead to feature selection |
Hyperparameter Tuning | Mostly C and penalty (l1 , l2 , elasticnet ) |
For Lasso: mainly alpha ; For Ridge: alpha |
Solver Options | Supports multiple solvers like liblinear , saga etc. |
Generally uses coordinate descent for Lasso and closed-form solution or iterative solvers for Ridge |
Use Case | When you have a binary/multi-class classification problem | When you have a regression problem and want to incorporate regularization |
Cross-Validation | Use LogisticRegressionCV for cross-validation. Parameters include Cs (list of regularization strengths to try), cv (number of cross-validation folds), and penalty . |
Use LassoCV for Lasso and RidgeCV for Ridge. Both allow you to specify alphas (list of regularization strengths to try) and cv (number of cross-validation folds). |
Common Pitfalls for Newbies | 1. Not scaling features can heavily impact performance. 2. Over-relying on default hyperparameters. 3. Misinterpreting the C parameter (smaller C means stronger regularization). |
1. Not scaling features, which can greatly affect the regularization's effectiveness. 2. Ignoring multicollinearity which Ridge can handle but Lasso may struggle with. 3. Not checking if some important features are completely removed by Lasso. |
In scikit-learn
, there are functions and classes that deal with cross-validation.
cross_val_score
Purpose: Computes the score for each CV split. It is a quick utility function to get a feel for a model's performance and only allow single metric.
Usage:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
# Create a linear regression model
model = LinearRegression()
# Use cross_val_score to get R^2 scores from 5-fold CV
scores = cross_val_score(model, X, y, cv=5)
cross_validate
Purpose: This is an extension of cross_val_score
. It allows you to specify multiple metrics for evaluation and can also return train scores, fit times, and score times.
Usage:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
# Create a linear regression model
model = LinearRegression()
# Use cross_validate to get training and testing scores using 5-fold CV
results = cross_validate(model, X, y, cv=5, return_train_score=True, scoring=['r2', 'neg_mean_squared_error'])
Purpose: These are classes that generate indices to split data into train/test sets. They can be used with any custom loop or framework, giving you more flexibility over the cross-validation process.
Usage:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
# Create a 5-fold cross-validation iterator
kf = KFold(n_splits=5, shuffle=True,
random_state=2023)
# Create a linear regression model
model = LinearRegression()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)
Summary:
cross_val_score
is a quick and simple function to get model scores over multiple CV splits.cross_validate
is a more comprehensive function that can provide multiple metrics, train scores, and time metrics.KFold
give you flexibility and control over the CV process, allowing you to integrate it with any custom framework or loop.As data scientists, we know that computers are great at aiding in repetitive tasks
source:https://www.goodreads.com/book/show/29437996-copying-and-pasting-from-stack-overflow
source:https://twitter.com/DataChaz/status/1642892653124624390/photo/1