Recap === ## [Basic of `sklearn`](https://scikit-learn.org/stable/developers/develop.html#different-objects) 1. **Estimator**: The core of scikit-learn's interface. Any object that can estimate some parameters based on a dataset is called an estimator. For example, an `estimator` might be a classifier or a regressor. The estimation itself is performed by the `fit()` method. 2. **Predictor**: A type of estimator that, once trained, is able to make predictions on new, unseen data. Predictors implement a `predict()` method. Most classifiers and regressors in scikit-learn are predictors. 3. **Transformer**: A type of estimator that can change or "transform" the data. A transformer might clean or preprocess the data in some way. It has a `transform()` method and often also a `fit()` method if the transformation needs to learn from the data (e.g., `StandardScaler`). 4. **Model**: In scikit-learn, once an estimator (like a classifier or regressor) has been trained using the `fit()` method, it becomes a model. The `score()` method can then be used to evaluate how well the model performs on a given test dataset. 5. [**Pipeline**](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators): A way to streamline a lot of the routine processes in machine learning. A `Pipeline` bundles together a sequence of data processing steps and modeling, where each step is represented by a tuple (`name`, `transform`/`predict` object). Each step in the pipeline must be a transformer, except for the last step which can be of any type (transformer, predictor). | Term | Definition | Key Methods | Typical Uses | |------------|------------------------------------------------------------------------------------------------------------------|------------------------|---------------------------------------| | Estimator | Any object that can estimate parameters based on a dataset. | `fit()` | Classification, Regression | | Predictor | An estimator that makes predictions on new, unseen data. | `fit()`, `predict()` | Making predictions on new data | | Transformer| An object that can transform a dataset. Often used for data preprocessing or feature engineering. | `fit()`, `transform()` | Data preprocessing, Feature engineering| | Model | A specific instance of an estimator that has been trained on data. Often used to evaluate how well it performs. | `fit()`, `predict()`, `score()` | Evaluation of the performance, Testing, Validation | | Pipeline | Streamlines processes by bundling together a sequence of data processing steps and modeling. | `fit()`, `transform()`, `predict()` | Sequencing multiple steps in modeling | See [here]( https://scikit-learn.org/stable/getting_started.html) or [here](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb) for more in-depth discussion. ## Linear regression Absolutely. Let's expand the table to include the handling of interaction terms, polynomial features, and categorical variables in both `scikit-learn` and `statsmodels`. | Feature/Aspect | scikit-learn | statsmodels | |---------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------| | **Library Import** | `from sklearn.linear_model import LinearRegression` | `import statsmodels.api as sm` <br> `from statsmodels.formula.api import ols` (for formula-based interface) | | **Model Creation** | `model = LinearRegression()` | `model = sm.OLS(y, X)` <br> Or `model = ols("y ~ X", data)` (for formula-based interface) | | **Fitting Model** | `model.fit(X, y)` | `results = model.fit()` | | **Prediction** | `predictions = model.predict(X_new)` | `predictions = results.predict(X_new)` | | **Coefficients & Intercept** | `model.coef_, model.intercept_` | `results.params` | | **Model Summary** | Not provided directly | `print(results.summary())` | | **Residuals** | Computed as `y - predictions` | `results.resid` | | **R-squared Value** | `model.score(X, y)` | `results.rsquared` | | **Hypothesis Testing** | Not provided directly | Available in `results.summary()` | | **Model Diagnostics** | Basic metrics. Advanced diagnostics need manual computation. | Extensive diagnostics provided | | **Intercept Handling** | `LinearRegression` fits an intercept by default. | Manually add a constant with `sm.add_constant(X)` | | **Order of X and y** | X first, then y | y first, then X | | **Formula-based Approach** | Not supported | Supported: `ols("y ~ X1 + X2", data).fit()` | | **Formula Intercept** | N/A | Automatically includes intercept. Use `- 1` to exclude it in formula. | | **Interaction Terms** | Need manual creation (e.g., `X['interaction'] = X['col1'] * X['col2']`) | Supported in formula: `ols("y ~ X1*X2", data).fit()` | | **Polynomial Features** | Use `from sklearn.preprocessing import PolynomialFeatures` | Manually create or use `I()` in formula: `ols("y ~ I(X1**2)", data).fit()` | | **Categorical Variables** | Need one-hot encoding using `from sklearn.preprocessing import OneHotEncoder` | Automatic handling in formula: `ols("y ~ C(category_col)", data).fit()` | | **Main Focus** | Prediction | Statistical inference | This table now includes the additional functionalities you mentioned. It provides a clearer perspective on how to handle these features in both libraries. ## Logistic regression | Feature/Aspect | scikit-learn (`LogisticRegression`) | statsmodels (`GLM` with `Binomial` family) | |--------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------| | **Library Import** | `from sklearn.linear_model import LogisticRegression` | `import statsmodels.api as sm` | | | | `from statsmodels.formula.api import glm` (for formula-based interface) | | **Model Creation** | `model = LogisticRegression(C=1e9)` (to approximate no regularization) | `model = sm.GLM(y, X, family=sm.families.Binomial())` | | | | Or `model = glm("y ~ X", data, family=sm.families.Binomial()).fit()` (for formula-based interface) | | **Fitting Model** | `model.fit(X, y)` | `results = model.fit()` | | **Prediction (probabilities)** | `probs = model.predict_proba(X_new)[:,1]` | `probs = results.predict(X_new)` | | **Class Prediction** | `classes = model.predict(X_new)` | `classes = (probs > 0.5).astype(int)` | | **Coefficients & Intercept** | `model.coef_`, `model.intercept_` | `results.params` | | **Model Summary** | Not provided directly | `print(results.summary())` | | **Deviance, AIC, BIC** | Not provided directly, can be computed via [here](https://stackoverflow.com/questions/50975774/calculate-residual-deviance-from-scikit-learn-logistic-regression-model) | Available in `results.summary()` | | **Intercept Handling** | By default, fits an intercept. | Manually add a constant using `sm.add_constant(X)` | | **Order of X and y** | `X` first, then `y` | `y` first, then `X` | | **Formula-based Approach** | Not supported | Supported: `glm("y ~ X1 + X2", data, family=sm.families.Binomial()).fit()` | | **Interaction Terms** | Need manual creation | Formula-based: Automatic with `glm("y ~ X1*X2", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual creation needed. | | **Formula Intercept** | N/A | Automatically includes intercept. Use `- 1` to exclude it in formula. | | **Polynomial Features** | Use `from sklearn.preprocessing import PolynomialFeatures` | Formula-based: Supported with `glm("y ~ I(X1**2)", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual creation needed. | | **Categorical Variables** | Need one-hot encoding using `from sklearn.preprocessing import OneHotEncoder` | Formula-based: Automatic with `glm("y ~ C(category_col)", data, family=sm.families.Binomial()).fit()`. For non-formula: Manual preprocessing needed. | ## Diagnostics plot and formula to add nonideal terms * https://www.statsmodels.org/stable/examples/notebooks/generated/linear_regression_diagnostics_plots.html * https://www.statsmodels.org/stable/examples/notebooks/generated/formulas.html ## Regularization | Feature/Aspect | Logistic Regression | Lasso (L1) / Ridge (L2) | |----------------------------------|-------------------------------------------------------------------------------|--------------------------------------------------------------| | **Purpose** | Classification | Regression | | **Library/Module** | `from sklearn.linear_model import LogisticRegression` | `from sklearn.linear_model import Lasso, Ridge` | | **Regularization** | Supports L1, L2, and ElasticNet (combination of L1 & L2) | Lasso: L1 regularization; Ridge: L2 regularization | | **Parameter for Regularization** | `C` (Inverse of regularization strength) | `alpha` (Regularization strength) | | **Loss Function** | Cross-entropy Loss | Lasso: Mean Absolute Error; Ridge: Mean Squared Error | | **Feature Selection** | L1 can lead to feature selection | Lasso can lead to feature selection | | **Hyperparameter Tuning** | Mostly `C` and `penalty` (`l1`, `l2`, `elasticnet`) | For Lasso: mainly `alpha`; For Ridge: `alpha` | | **Solver Options** | Supports multiple solvers like `liblinear`, `saga` etc. | Generally uses coordinate descent for Lasso and closed-form solution or iterative solvers for Ridge | | **Use Case** | When you have a binary/multi-class classification problem | When you have a regression problem and want to incorporate regularization | | **Cross-Validation** | Use `LogisticRegressionCV` for cross-validation. Parameters include `Cs` (list of regularization strengths to try), `cv` (number of cross-validation folds), and `penalty`. | Use `LassoCV` for Lasso and `RidgeCV` for Ridge. Both allow you to specify alphas (list of regularization strengths to try) and `cv` (number of cross-validation folds). | | **Common Pitfalls for Newbies** | 1. Not scaling features can heavily impact performance. <br>2. Over-relying on default hyperparameters. <br>3. Misinterpreting the `C` parameter (smaller `C` means stronger regularization). | 1. Not scaling features, which can greatly affect the regularization's effectiveness. <br>2. Ignoring multicollinearity which Ridge can handle but Lasso may struggle with. <br>3. Not checking if some important features are completely removed by Lasso. | ## Cross validation In `scikit-learn`, there are functions and classes that deal with cross-validation. ### 1. `cross_val_score` **Purpose**: Computes the score for each CV split. It is a quick utility function to get a feel for a model's performance and only allow single metric. **Usage**: ```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression # Create a linear regression model model = LinearRegression() # Use cross_val_score to get R^2 scores from 5-fold CV scores = cross_val_score(model, X, y, cv=5) ``` ### 2. `cross_validate` **Purpose**: This is an extension of `cross_val_score`. It allows you to specify multiple metrics for evaluation and can also return train scores, fit times, and score times. **Usage**: ```python from sklearn.model_selection import cross_validate from sklearn.linear_model import LinearRegression # Create a linear regression model model = LinearRegression() # Use cross_validate to get training and testing scores using 5-fold CV results = cross_validate(model, X, y, cv=5, return_train_score=True, scoring=['r2', 'neg_mean_squared_error']) ``` ### 3. Cross-validation Iterators **Purpose**: These are classes that generate indices to split data into train/test sets. They can be used with any custom loop or framework, giving you more flexibility over the cross-validation process. **Usage**: ```python from sklearn.model_selection import KFold from sklearn.linear_model import LinearRegression # Create a 5-fold cross-validation iterator kf = KFold(n_splits=5, shuffle=True, random_state=2023) # Create a linear regression model model = LinearRegression() for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model.fit(X_train, y_train) score = model.score(X_test, y_test) print(score) ``` **Summary**: - `cross_val_score` is a quick and simple function to get model scores over multiple CV splits. - `cross_validate` is a more comprehensive function that can provide multiple metrics, train scores, and time metrics. - Cross-validation iterators like `KFold` give you flexibility and control over the CV process, allowing you to integrate it with any custom framework or loop. - You can use different [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) for scoring. ## Lectures * Course website: https://phonchi.github.io/nsysu-math524/materials and [Lab](https://github.com/phonchi/ISLP_labs/tree/main) * For the programming patterns: Reference book: [Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python](https://github.com/gedeck/practical-statistics-for-data-scientists) * [Python DataScience Handbook](https://github.com/jakevdp/PythonDataScienceHandbook/tree/v2/notebooks_v2) * [Machine learning from scratch](https://dafriedman97.github.io/mlbook/content/introduction.html) ## Basics * https://learnxinyminutes.com/docs/python/ * [Computer Programmin Course](https://phonchi.github.io/Computer_Programming/) * [Python cheat sheet](https://gto76.github.io/python-cheatsheet/) * [NumPy cheat sheet](https://github.com/juliangaal/python-cheat-sheet/tree/master/NumPy) * [SciPy lectures](https://scipy-lectures.org/) ## Libraries ### Pandas and matplotlib * [Pandas cheat sheat](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) * [Matplotlib cheat sheet](https://matplotlib.org/cheatsheets/) * [Data Science Tools study guides ](https://github.com/shervinea/mit-15-003-data-science-tools) ### Sklearn * [Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) * [Supervised learning from sklearn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) * [Model selection](https://scikit-learn.org/stable/model_selection.html) * [Visualization](https://scikit-learn.org/stable/visualizations.html) * [Common pitfalls](https://scikit-learn.org/stable/common_pitfalls.html#common-pitfalls-and-recommended-practices) ### Statsmodel * [OLS](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS) * [OLS Results](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.html) * [OLS Influence](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.OLSInfluence.html) * [GLM](https://www.statsmodels.org/devel/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM) * [GLM Results](https://www.statsmodels.org/devel/generated/statsmodels.genmod.generalized_linear_model.GLMResults.html#statsmodels.genmod.generalized_linear_model.GLMResults) * [Visualization](https://www.statsmodels.org/devel/graphics.html) ## Overview As data scientists, we know that computers are great at aiding in repetitive tasks - We have a vast range of tools available at our fingertips that enable us to be more productive and solve more complex problems when working on any computer-related problem - Yet many of us utilize only a tiny fraction of those tools; In this mini-course, I will try my best to help you become familiar with what kind of tools may be useful in your research <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SkFVWlYz6.png" style="display: block; margin-left: auto; margin-right: auto; width: 60%;" /> </div> source:https://www.goodreads.com/book/show/29437996-copying-and-pasting-from-stack-overflow ![]( =60%x) <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HkMabetMa.png" style="display: block; margin-left: auto; margin-right: auto; width: 60%;" /> </div> source:https://twitter.com/DataChaz/status/1642892653124624390/photo/1 ## [Search tips](https://shaform.com/csdream/docs/search/) * [Google advance search](https://www.google.com/advanced_search) * https://www.google.com/search?q=resume+site:cs.cmu.edu+filetype:pdf * https://stackexchange.com/ * https://www.kaggle.com/code * https://github.com/search