Recap

Basic of `sklearn`

Estimator: The core of scikit-learn's interface. Any object that can estimate some parameters based on a dataset is called an estimator. For example, an estimator might be a classifier or a regressor. The estimation itself is performed by the fit() method.
Predictor: A type of estimator that, once trained, is able to make predictions on new, unseen data. Predictors implement a predict() method. Most classifiers and regressors in scikit-learn are predictors.
Transformer: A type of estimator that can change or "transform" the data. A transformer might clean or preprocess the data in some way. It has a transform() method and often also a fit() method if the transformation needs to learn from the data (e.g., StandardScaler).
Model: In scikit-learn, once an estimator (like a classifier or regressor) has been trained using the fit() method, it becomes a model. The score() method can then be used to evaluate how well the model performs on a given test dataset.
Pipeline: A way to streamline a lot of the routine processes in machine learning. A Pipeline bundles together a sequence of data processing steps and modeling, where each step is represented by a tuple (name, transform/predict object). Each step in the pipeline must be a transformer, except for the last step which can be of any type (transformer, predictor).

Term	Definition	Key Methods	Typical Uses
Estimator	Any object that can estimate parameters based on a dataset.	`fit()`	Classification, Regression
Predictor	An estimator that makes predictions on new, unseen data.	`fit()`, `predict()`	Making predictions on new data
Transformer	An object that can transform a dataset. Often used for data preprocessing or feature engineering.	`fit()`, `transform()`	Data preprocessing, Feature engineering
Model	A specific instance of an estimator that has been trained on data. Often used to evaluate how well it performs.	`fit()`, `predict()`, `score()`	Evaluation of the performance, Testing, Validation
Pipeline	Streamlines processes by bundling together a sequence of data processing steps and modeling.	`fit()`, `transform()`, `predict()`	Sequencing multiple steps in modeling

See here or here for more in-depth discussion.

Linear regression

Absolutely. Let's expand the table to include the handling of interaction terms, polynomial features, and categorical variables in both scikit-learn and statsmodels.

Feature/Aspect	scikit-learn	statsmodels
Library Import	`from sklearn.linear_model import LinearRegression`	`import statsmodels.api as sm` `from statsmodels.formula.api import ols` (for formula-based interface)
Model Creation	`model = LinearRegression()`	`model = sm.OLS(y, X)` Or `model = ols("y ~ X", data)` (for formula-based interface)
Fitting Model	`model.fit(X, y)`	`results = model.fit()`
Prediction	`predictions = model.predict(X_new)`	`predictions = results.predict(X_new)`
Coefficients & Intercept	`model.coef_, model.intercept_`	`results.params`
Model Summary	Not provided directly	`print(results.summary())`
Residuals	Computed as `y - predictions`	`results.resid`
R-squared Value	`model.score(X, y)`	`results.rsquared`
Hypothesis Testing	Not provided directly	Available in `results.summary()`
Model Diagnostics	Basic metrics. Advanced diagnostics need manual computation.	Extensive diagnostics provided
Intercept Handling	`LinearRegression` fits an intercept by default.	Manually add a constant with `sm.add_constant(X)`
Order of X and y	X first, then y	y first, then X
Formula-based Approach	Not supported	Supported: `ols("y ~ X1 + X2", data).fit()`
Formula Intercept	N/A	Automatically includes intercept. Use `- 1` to exclude it in formula.
Interaction Terms	Need manual creation (e.g., `X['interaction'] = X['col1'] * X['col2']`)	Supported in formula: `ols("y ~ X1*X2", data).fit()`
Polynomial Features	Use `from sklearn.preprocessing import PolynomialFeatures`	Manually create or use `I()` in formula: `ols("y ~ I(X1**2)", data).fit()`
Categorical Variables	Need one-hot encoding using `from sklearn.preprocessing import OneHotEncoder`	Automatic handling in formula: `ols("y ~ C(category_col)", data).fit()`
Main Focus	Prediction	Statistical inference

This table now includes the additional functionalities you mentioned. It provides a clearer perspective on how to handle these features in both libraries.

Logistic regression

Feature/Aspect	scikit-learn (`LogisticRegression`)	statsmodels (`GLM` with `Binomial` family)
Library Import	`from sklearn.linear_model import LogisticRegression`	`import statsmodels.api as sm`
		`from statsmodels.formula.api import glm` (for formula-based interface)
Model Creation	`model = LogisticRegression(C=1e9)` (to approximate no regularization)	`model = sm.GLM(y, X, family=sm.families.Binomial())`
		Or `model = glm("y ~ X", data, family=sm.families.Binomial()).fit()` (for formula-based interface)
Fitting Model	`model.fit(X, y)`	`results = model.fit()`
Prediction (probabilities)	`probs = model.predict_proba(X_new)[:,1]`	`probs = results.predict(X_new)`
Class Prediction	`classes = model.predict(X_new)`	`classes = (probs > 0.5).astype(int)`
Coefficients & Intercept	`model.coef_`, `model.intercept_`	`results.params`
Model Summary	Not provided directly	`print(results.summary())`
Deviance, AIC, BIC	Not provided directly, can be computed via here	Available in `results.summary()`
Intercept Handling	By default, fits an intercept.	Manually add a constant using `sm.add_constant(X)`
Order of X and y	`X` first, then `y`	`y` first, then `X`
Formula-based Approach	Not supported	Supported: `glm("y ~ X1 + X2", data, family=sm.families.Binomial()).fit()`
Interaction Terms	Need manual creation	Formula-based: Automatic with `glm("y ~ X1*X2", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual creation needed.
Formula Intercept	N/A	Automatically includes intercept. Use `- 1` to exclude it in formula.
Polynomial Features	Use `from sklearn.preprocessing import PolynomialFeatures`	Formula-based: Supported with `glm("y ~ I(X1**2)", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual creation needed.
Categorical Variables	Need one-hot encoding using `from sklearn.preprocessing import OneHotEncoder`	Formula-based: Automatic with `glm("y ~ C(category_col)", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual preprocessing needed.

Diagnostics plot and formula to add nonideal terms

Regularization

Feature/Aspect	Logistic Regression	Lasso (L1) / Ridge (L2)
Purpose	Classification	Regression
Library/Module	`from sklearn.linear_model import LogisticRegression`	`from sklearn.linear_model import Lasso, Ridge`
Regularization	Supports L1, L2, and ElasticNet (combination of L1 & L2)	Lasso: L1 regularization; Ridge: L2 regularization
Parameter for Regularization	`C` (Inverse of regularization strength)	`alpha` (Regularization strength)
Loss Function	Cross-entropy Loss	Lasso: Mean Absolute Error; Ridge: Mean Squared Error
Feature Selection	L1 can lead to feature selection	Lasso can lead to feature selection
Hyperparameter Tuning	Mostly `C` and `penalty` (`l1`, `l2`, `elasticnet`)	For Lasso: mainly `alpha`; For Ridge: `alpha`
Solver Options	Supports multiple solvers like `liblinear`, `saga` etc.	Generally uses coordinate descent for Lasso and closed-form solution or iterative solvers for Ridge
Use Case	When you have a binary/multi-class classification problem	When you have a regression problem and want to incorporate regularization
Cross-Validation	Use `LogisticRegressionCV` for cross-validation. Parameters include `Cs` (list of regularization strengths to try), `cv` (number of cross-validation folds), and `penalty`.	Use `LassoCV` for Lasso and `RidgeCV` for Ridge. Both allow you to specify alphas (list of regularization strengths to try) and `cv` (number of cross-validation folds).
Common Pitfalls for Newbies	1. Not scaling features can heavily impact performance. 2. Over-relying on default hyperparameters. 3. Misinterpreting the `C` parameter (smaller `C` means stronger regularization).	1. Not scaling features, which can greatly affect the regularization's effectiveness. 2. Ignoring multicollinearity which Ridge can handle but Lasso may struggle with. 3. Not checking if some important features are completely removed by Lasso.

Cross validation

In scikit-learn, there are functions and classes that deal with cross-validation.

1. `cross_val_score`

Purpose: Computes the score for each CV split. It is a quick utility function to get a feel for a model's performance and only allow single metric.

Usage:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Use cross_val_score to get R^2 scores from 5-fold CV
scores = cross_val_score(model, X, y, cv=5)

2. `cross_validate`

Purpose: This is an extension of cross_val_score. It allows you to specify multiple metrics for evaluation and can also return train scores, fit times, and score times.

Usage:

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Use cross_validate to get training and testing scores using 5-fold CV
results = cross_validate(model, X, y, cv=5, return_train_score=True, scoring=['r2', 'neg_mean_squared_error'])

3. Cross-validation Iterators

Purpose: These are classes that generate indices to split data into train/test sets. They can be used with any custom loop or framework, giving you more flexibility over the cross-validation process.

Usage:

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

# Create a 5-fold cross-validation iterator
kf = KFold(n_splits=5, shuffle=True,
            random_state=2023)

# Create a linear regression model
model = LinearRegression()

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(score)

Summary:

cross_val_score is a quick and simple function to get model scores over multiple CV splits.
cross_validate is a more comprehensive function that can provide multiple metrics, train scores, and time metrics.
Cross-validation iterators like KFold give you flexibility and control over the CV process, allowing you to integrate it with any custom framework or loop.
You can use different metrics for scoring.

Lectures

Course website: https://phonchi.github.io/nsysu-math524/materials and Lab
For the programming patterns: Reference book: Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python
Python DataScience Handbook
Machine learning from scratch

Basics

Libraries

Pandas and matplotlib

Sklearn

Statsmodel

Overview

As data scientists, we know that computers are great at aiding in repetitive tasks

We have a vast range of tools available at our fingertips that enable us to be more productive and solve more complex problems when working on any computer-related problem
Yet many of us utilize only a tiny fraction of those tools; In this mini-course, I will try my best to help you become familiar with what kind of tools may be useful in your research

source:https://www.goodreads.com/book/show/29437996-copying-and-pasting-from-stack-overflow

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

source:https://twitter.com/DataChaz/status/1642892653124624390/photo/1

Recap

Basic of sklearn

Linear regression

Logistic regression

Diagnostics plot and formula to add nonideal terms

Regularization

Cross validation

1. cross_val_score

2. cross_validate

3. Cross-validation Iterators

Lectures

Basics

Libraries

Pandas and matplotlib

Sklearn

Statsmodel

Overview

Search tips

Read more

NumPy

Matplotlib

OOP Basics In python - Python Cheatsheet

Manipulating strings

Basic of `sklearn`

1. `cross_val_score`

2. `cross_validate`