Intro module: What is Machine Learning

Note: to edit on HackMD https://hackmd.io/6fFy3_y6SOWonOYeGaNiHw?edit # Intro module: What is Machine Learning (only slides / videos i.e. no code) ## Learning objectives: - familiarity with scope of most common ML applications in business or science - understand difference between memorization and generalization - familiarity with main machine learning concepts and vocabulary ## Content Why and when? Example applications. - Mention the iris example, pitch it as historical, but also as a botanical and agriculture problem. The benefit of this example is that it forces to think about measurement, but also because it has one class that is easy to separate only one feature - Aurelie: real irises for the video? artificial flowers ordered (thanks Anne) - The "adult" dataset - Maybe looking at it with excel, to be in an environment familiar to people - Mention the importance of data visualization: intuitions about the data can be very helpful Descriptive vs predictive analysis - Generalization (Out of sample properties) - An example of where it makes a difference: if the data has redundant variables, such as expressing the education level as the name of the degree or the corresponding number of years of education Learning from data vs expertly engineered decision rules - One the iris example, show that cutting on one specific feature separates well one class - How do we automate this? How do we achieve this on more complex data such as the census dataset? Generalization vs memorization: the need for a train / test split - The nearest neighbors example to illustrate this Supervised vs Unsupervised - Formalize supervised learning (define "X" and "y") - Introduce unsupervised learning, for instance dimensionality reduction (and go back to the example of redundant variables: if we have many of these, we should be able to reduce the problem without even looking at y Regression vs Classification - In the adult data: it would make more sense to do a continuous prediction - In the iris example, it is naturally a classification problem Features and samples - The data matrix - Build the data matrix of Iris A few words about the style and scope of this course: it is centered around code, though we strive to keep it simple ## Quizz: Given a case study (e.g. pricing apartments based on a real estate website database) and sample toy dataset: say whether it’s an application of supervised vs unsupervised, classification vs regression, what are the features, what is the target variable, what is a record. Propose a hand engineer decision rule that can be used as a baseline Propose a quantitative evaluation of the success of this decision rule. # The Predictive Modeling Pipeline ## Notebook module #1: exploratory analysis ### Learning objectives: - load tabular data with pandas - visualize marginal distribution with histograms - visualize pairwise interactions with scatter plots - identify outlier and dynamic range of each column ### Content Defining a predictive task that relates to the business or scientific case Pandas read_csv Simple exploratory data analysis with pandas and matplotlib ## Notebook module #2: basic preprocessing for minimal model fit ### Learning objectives: - Know the difference between a numerical and a categorical variable - use a scaler - convert category labels to dummy variables - combine feature preprocessing and model with pipeline - evaluate generalization of model with cross-validation ### Content Prepare a train / test split Basic model on numerical features only Basic processing: missing values and scaling Use a pipeline to evaluate model with cross-validation with and without scaling Handling categorical variables with one-hot encoding Use the column transformer to build pipeline with heterogeneous dtype Model fitting and performance evaluation with cross-validation - Gael thinks that we could use a video here for cross-validation (in particular, the "plot_cv_indices" in the notebook gets a bit in the way of being accessible and didactic ## Notebook module #3: basic parameter tuning and final test score evaluation ### Learning objectives: - Learn to no trust blindly the default parameters of scikit-learn estimators ### Content Parameter tuning with Grid and Random hyperparameter search Nested cross-validation Confirmation of performance with final test set # Supervised learning ## Learning objectives: Understand decision rules for a few important algorithms Know how to diagnose model generalization errors (overfitting especially) How to use variable selection and generalization to fight overfitting Feature engineering to limit underfitting ## Olivier: Overfitting/Underfitting validation curves, learning curves, regularisation with linear models - Video about overfitting? ## Loïc: Trees in depth + ensembles ## Guillaume: Evaluation of supervised learning models: Confusion matrix for classifiers / precision / recall / ROC AUC curve (Mention imbalanced classes) Predict vs True plot for regressors ## Olivier: Linear models in depth Logistic Regression, linear regression, classification vs regression, multi-class, linear separability. Pros and cons L1 and L2 penalty for linear models Learning curves and validation curves (video: how to read curves) ## Baselines: majority class classifier (already in second module) and k-nearest neighbors ## Feature engineering to augment the expressivity of linear models: Binning / Polynomial feature extraction / Nystroem method Feature selection to combat overfitting and speed-up models ## Univariate feature selection Show catastrophic example where feature selection is done on the whole dataset rather than only on train ## Evaluating the feature importance with permutations Failure Mode : cardinality bias of overfitting random forest feature importances ## Looking at the decision function with partial dependence plots Gael thinks that explaining the difference between conditional and marginal interpretation is important. Stability of hyperparameter during cross-validation

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.