$$
\newcommand{\t}{\text}
\newcommand{\f}{\frac}
\newcommand{\Ra}{\Rightarrow}
\newcommand{\La}{\Lightarrow}
\newcommand{\ra}{\rightarrow}
\newcommand{\la}{\leftarrow}
\newcommand{\ua}{\uparrow}
\newcommand{\da}{\downarrow}
\newcommand{\mbb}{\mathbb}
\newcommand{\mbf}{\mathbf}
\newcommand{\mc}{\mathcal}
\newcommand{\pt}{\partial}
\newcommand{\S}[2]{\sum_{#1}^{#2}}
\newcommand{\P}[2]{\product_{#1}^{#2}}
\newcommand{\bmat}[4]{\begin{bmatrix} #1 & #2 \\ #3 & #4\end{bmatrix}}
$$
## Gradient descent
+ Optimization algorithm
+ find parameters to minimize cost function
+ Definition
+ $\theta=\theta-\alpha\frac{\partial}{\partial{\theta}}L(\theta)$
+ $\alpha$: learning rate
+ big: aggressive gradient descent (overshoot)
+ small: take tiny steps (slow)
+ gradient: the direction that increase the fastest
***
## Supervised learning
### Linear regression
+ Definition
+ find linear relationship between target and predictors
+ $h_{\theta}(x)=\theta_0+\theta_1{x}_1+\theta_2{x}_2+...=\theta^{T}X$
+ Cost function
+ $L=\frac{1}{2n}\sum_{i=1}^{n}(h_{\theta}(x_{i}) - y_{i})^2=\frac{1}{2n}RSS$
+ convex function: always converge to global optima
+ normal equation (least-squares)
+ $\theta^{*}=(X^{T}X)^{-1}X^{T}Y$
+ $\frac{\partial}{\partial\theta}L=0$
+ no need to choose learning rate
+ no need iterations
+ need to compute inverse
+ slower when m is large
+ Assumptions
+ linear relationship
+ error terms: $N(0,\sigma^2)$
+ independent samples
+ collinearity
+ VIF (>5 or 10)
+ unstable estimates of coefficients
+ Evaluation
+ R-squared: proportion of variability in y that can explained using x
+ $1-\frac{RSS}{TSS}=1-\frac{\sum_{i=1}^{n}[h_{\theta}(x_{i}) - y_{i}]^2}{\sum_{i=1}^{n}(y_{i}-\bar{y})^2}$
+ RMSE
+ $\sqrt{\frac{1}{n}\sum_{i=1}^{n}[h_{\theta}(x_{i})-y_{i}]^2}$
+ Regularization
+ lasso regression (L1)
+ $L_{lasso}=\Vert{Y-\theta^{T}X}\Vert^2+\lambda*\Vert{\theta}\Vert_1$
+ ridge regression (L2)
+ $L_{ridge}=\Vert{Y-\theta^{T}X}\Vert^2+\lambda*\Vert{\theta}\Vert_2^2$
### Logistic Regression
+ Formula
+ $h_{\theta}(x)=P(y=1|x;\theta)=g(\theta^{T}x)=\frac{1}{1+e^{-\theta^{T}x}}$
+ $g(z)=\frac{1}{1+e^{-z}}$ is called sigmoid function
+ cross $(x, y)=(0, 0.5)$
+ 0 when z small, 1 when z big
+ Non-linear decision boundary
+ $h_{\theta}(x)=g(\theta_0+\theta_1{x_1}+\theta_2{x_1}^2+...)$
+ higher order $\rightarrow$ more complex decision boundaries
+ Cost function
+ $\begin{aligned}
L&=\frac{1}{n}\sum_{i=1}^{n}I(y_{i}=1)*[-\log(h_{\theta}(x_{i}))]+I(y_{i}=0)*[-\log(1-h_{\theta}(x_{i}))] \\
&=\frac{1}{n}\sum_{i=1}^{n}-y_{i}*\log(h_{\theta}(x_{i}))-(1-y_{i})*\log(1-h_{\theta}(x_{i}))
\end{aligned}$
+ $y=1$ and $h_{\theta}(x)=1$ $\rightarrow$ cost=0
+ ($y=1$ but $h_{\theta}(x)=0$) or ($y=0$ but $h_{\theta}(x)=1$) $\rightarrow$ cost goes to $\infty$

+ convex, avoid local minimum
+ can be drived by Maximum Likelihood Estimation (MLE)
+ estimate parameters
+ Multiclass classification
+ $h_{\theta}^{(j)}(x); j=1,...,c$
+ pick the class $i$ if $h_{\theta}^{(i)}(x)>h_{\theta}^{(j)}(x)\forall{j}$
+ Pros
+ simple & interpretable
+ works well for linearly separable data
+ probabilistic output
+ Cons
+ assume linearity (features and log-odds)
+ sensitive to outliers
+ require feature scaling
+ cannot handle multicollinearity well (unstable coefficients)
+ binary by default
+ Odds, odds ratios, log-odds
+ odds = $\frac{P(y=1)}{P(y=0)}=\frac{p}{1-p}$
+ log-odds (logit) = $\log(\frac{p}{1-p})$
+ $\log(\frac{p}{1-p})=w_0+w_1x_1+...+w_nx_n$
+ odds ratios = $e^{w_i}$
+ how the odds change when feature $x_i$ inceases by 1 unit, holding all else constant
### Decision Tree
+ Advantages
+ mimic human level thinking
+ tree can be visualized
+ no assumptions
+ no need to standardize the data
+ Disadvantages
+ unstable
+ overfitting
+ long time to calculate
+ Common algorithms
+ CART (Classification and Regression Trees)
+ gini
+ $1-\sum_{i=1}^c{p_i^2}$
+ $p_i$: the prob. of an object being classified to a particular class
+ 0: all data belongs to single class
+ 1: data belongs to different classes
+ the smaller the better
+ e.g.

+ gini index of gender = $\f{\t{# of F}}{\t{# of Total}}\times(\t{Gini-of-F})+\f{\text{# of M}}{\t{# of Total}}\times(\t{Gini-of-M})$
+ $\f{\t{# of F}}{\t{# of Total}}=\f{7}{13}$
+ $\f{\t{# of M}}{\t{# of Total}}=\f{6}{13}$
+ $\t{Gini-of-F}=1-(\t{F-Yes})^2-(\t{F-No})^2=1-(5/7)^2-(2/7)^2$
+ $\t{Gini-of-M}=1-(\t{M-Yes})^2-(\t{M-No})^2=1-(2/6)^2-(4/6)^2$
+ ID3 (Iterative Dichotomiser 3)
+ entropy $H=-\sum_{c\in{C}}p(c)\log{p(c)}$
+ $H=0$: all data belongs to single class
+ $H$ can exceed 1 if there are more than two classes
+ information gain
+ difference between the parent node entropy and the weighted average entropy of the children nodes
+ How much entropy been removed
+ $IG=H(S)-\sum_{t\in{T}}p(t)H(t)$
+ $T$: subsets created from splitting $S$ by selected attribute
+ the larger the better
+ (-) tends to favor the predictor with a large number of values and split the data into lots of subsets with low entropy values
+ e.g.
Entire dataset: (Yes=9, No=5)
Breakdown by outlook: sunny (Yes=2, No=3), overcast (Yes=4, No=0), rain (Yes=3, No=2)
$\t{Entropy}(S)=-p_+\log(p_+)-p_-\log(p_-)=-\f{9}{14}\log(\f{9}{14})-\f{5}{14}\log(\f{5}{14})=0.94$
$\t{Entropy}(\t{sunny})=-\f{2}{5}\log(\f{2}{5})-\f{3}{5}\log(\f{3}{5})=0.971$
$\t{Entropy}(\t{overcast})=-\f{4}{4}\log(\f{4}{4})-0\log(0)=0$
$\t{Entropy}(\t{rain})=-\f{3}{5}\log(\f{3}{5})-\f{2}{5}\log(\f{2}{5})=0.971$
$\t{Gain}(S, \t{outlook})=0.94-(\f{5}{14}\times0.972+\f{4}{14}\times0+\f{5}{14}\times0.971)=0.246$
+ information gain ratio
+ normalize by the split entropy
+ penalize attributes with many values
+ $\f{IG}{\t{Split Entropy}}$
+ e.g.
$\t{SplitEntropy}(S, \t{outlook})=-\f{5}{14}\log(\f{5}{14})-\f{4}{14}\log(\f{4}{14})-\f{5}{14}\log(\f{5}{14})=0.94$
$\t{IGRatio}(S, \t{outlook})=\f{0.246}{0.94}$
+ How to choose best attribute?
+ best classify training
+ When stop?
+ impurity=0
+ all $x$ the same
+ criteria
+ Prevent overfitting
+ ensemble methods
+ maximum depth
+ minimum samples required to split
+ higher information gain required to split
+ minimum data points in the leaf node
### Random Forest
+ Bagging + decision tree
+ Algorithm
+ request sub-training data by bootstrapping (random sampling w/ replacement)
+ obtain tree by sub-training data
+ repeat multiple times
+ Out-of-bag (OOB)
+ data that not used for obtain tree
+ used for self-validation
+ Pros
+ high accuracy (many decision trees)
+ robust to overfitting
+ handle bost classification & regression
+ works well with high-dimensional data (feature selection)
+ feature importance
+ non-linear relationship
+ handle missing values
+ Cons
+ less interpretability
+ large model size
+ slower prediction time (agg over all trees)
+ Importance
+ model selection
+ scikit-learn: how much each feature contributes to decreasing the averaging weighted impurity
+ (-) biased, prefer continuous features or high-cardinality categorical variables
+ permutation importance algorithm
+ observe how random re-shuffling of each predictor influences model performance
+ algorithm
1. compute the score $s$ of model on data $D$ (e.g. accuracy for classifiers or $R^2$ for regressors)
2. for each repetition $k$ in $1,...,K$:
+ randomly shuffle column $j$ of dataset $D$ to be $\tilde{D}_{k,j}$
+ compute score $s_{k,j}$ of model on data $\tilde{D}_{k,j}$
3. compute importance $i_j=s-\sum_{k=1}^{K}s_{k,j}$
+ pros
+ applicable to any model
+ reasonably effecient
+ reliable technique
+ no need to retrain model
+ cons
+ more computationally expensive than scikit-learn
### K-Nearest Neighbor (KNN)
+ Algorithm
+ calculate the distance between the test data and all the training points
+ select the $K$ number of points which is closest to the test data
+ decide the class by voting of "K" selected training points (mean of "K" points when regression)
+ $K$ value
+ small value of $K$ leads to unstable decision boundaries
+ large $K$ leads to smooth decision boundaries
+ plot error rate~$K$, choose the value with minimum error rate
+ Distance / Similarity
+ euclidean distance: straight line distance
+ $d(x,y)=\sqrt{\sum_{i=1}^n(y_i-x_i)^2}$
+ manhattan distance
+ $d(x,y)=\sum_{i=1}^n|x_i-y_i|$
+ work well when data has many sparse or high-dim features
+ minkowski distance: generalization of above two distances
+ $d(x,y)=(\sum_{i=1}^n|x_i-y_i|^p)^{1/p}$
+ $p\rightarrow\infty$: chebyshev
+ $d(x,y)=\max_i|x_i-y_i|$
+ hamming distance (for categorical variables)
+ $D_H=\sum_{i=1}^nI(x_i\neq{y_i})$
+ inner product similarity
+ $Sim(x,y)=x\cdot{y}$
+ cosine similarity: focus on direction
+ $Sim(x,y)=\frac{x\cdot{y}}{\Vert{x}\Vert\Vert{y}\Vert}$
+ Performing algorithm
+ brute force
+ algorithm
+ calculate the euclidean distance from point of interest to all the points in training set
+ take class with majority points
+ k-dimensional tree (kd tree)
+ hierarchical binary tree
+ algorithm
+ rearrange the whole dataset in a binary tree structure

1. pick random dimension
2. find median
3. split data
4. repeat
+ test data traverses through the tree
+ (+) take less time than brute search
+ ball tree
+ hierarchical data structure
+ effecient specially in case of higher dimensions
+ algorithm


1. two clusters are created initially
+ cluster is encompassed by a circle (2D) or a sphere (3D)
2. for each point, calculate distance to centroid of each cluster and assign cluster with closer centroid
3. each cluster divides into sub clusters again and the points are classified into each cluster on the basis of distance from centroid
4. keep to divide untill a certain depth
+ Pros
+ easy to implement
+ simplicity
+ accuracy
+ adapt easily
+ few hyperparameters
+ k
+ distance metric
+ Cons
+ computational inefficiency
+ more memory
+ more data storage
+ sensitive to irrelevant or redundant features
+ distance-based (noisy or irrelevant features)
+ curse of dimensionality
+ not perform well when data is high-dimensional (the difference is not clear when high-dimension)
+ overfitting
+ curse of dimensionality (not enough data to deal with it)
+ $k$ too small
### Support Vector Machine (SVM)
+ Find a (margin maximizing) hyperplane that best separates the features
+ Terminologies
+ support vector: the points closest to the hyperplane
+ margin: the distance between support vectors from different classes
+ hyperplane
+ used to differentiate between features
+ $y=\t{sign}(W^{T}X+b)$

+ Hard margin SVM
+ every point needs to be correctly classified
+ $y(W^T{X}+b)\geq{1}$
+ Soft margin SVM
+ almost every point could be correctly classified (skip some outliers)
+ $y(W^T{X}+b)\geq{1-\epsilon}$
+ $\min_{w, b}\frac{1}{2}\Vert{w}\Vert^2+\sum_{i=1}^n\epsilon^2$ $s.t.$ $y_i(w^{T}X_i+b)\geq{1-\epsilon_i};\forall{i=1,...,n}$
+ Loss function
+ hinge loss: $\max\{0, 1-y(W^T{X}+b)\}$
+ loss=0 if correctly classified
+ the data points on the hyperplane have a loss of 1
+ full objective function (with regularization)
+ $\frac{1}{2}\Vert{w}\Vert^2+C\sum_{i=1}^n\max{(0, 1-y_i(W^T{X_i}+b))}$
+ Dual form of SVM
+ $\max_\alpha\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_j{y_i}{y_j}(X_i^{T}X_j)$ $s.t.$ $\alpha_i\geq{0};\forall{i=1,...,n}$ and $\sum_{i=1}^n\alpha_i{y_i}=0$
+ support vector: $\alpha>0$
+ pros
+ only depend on $\alpha$ (primal: $w, b$)
+ suitable when $n$ is small
+ do not require actual data point, only the dot product between every pair of a vector
+ kernel trick
+ transfer data to a higher dimensional space to make it linear separable
+ replace dot product of two transformed vectors by the kernel function
+ $K(X_i, X_j)=<\phi(X_i), \phi(X_j)>$
+ no need to calculate $\phi(X)$
+ reduces the computation time
+ ex. $\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_j{y_i}{y_j}K(X_i^{T}X_j)$
+ Types of kernel
+ linear kernel: data is roughly linearly separable
+ polynomial kernel: data has complicated curved border
+ $K(X_i, X_j)=(a+X_i^{T}X_j)^b$
+ radial basis function kernel (RBF) / gaussian kernel: data has no clear boundries and contains complicated areas of overlap
+ $K(X_i, X_j)=\exp(-\gamma\Vert{X_i-X_j}\Vert^2)$
+ $\gamma\uparrow$: model overfitting
+ C and Gamma
+ C: L2 regularization parameter
+ small: the penalty for misclassification is small, selects the one with large margin
+ large: the penalty for misclassification is large, selects the one with small margin
+ Gamma: kernel function coefficient
+ can be seen as the inverse of the support vector influence radius
+ small: patterns will not be captured when too small
+ large: tends to overfit when too large
+ Pros
+ works for the dataset with very high dimensions, even when the number of dimensions is greater than the number of samples
+ works for smaller datasets (only rely on support vectors)
+ works for the dataset that is not linearly separable
+ global minimum (convex loss function)
+ fast when doing predictions
+ uses support vectors to make prediction
+ Cons
+ not easy to interpret (especially with non-linear kernels)
+ does not have predicted probability natively
+ sensitive to hyperparameters (c & gamma)
+ does not work well on large datasets (training time increases)
+ larger kernal size & more sv
+ data standardization is needed
+ sensitive to outliers, lots of noises, or overlaps between classes
### Interview questions
+ Describe the difference between a classification and a regression.
+ What are the assumptions of a linear regression model?
+ How to identify multicollinearity?
+ Explain the logistic regression algorithm?
+ What is the maximum liklihood estimation for logistic regression?
+ What is entropy in a decision tree model?
+ What is information gain and information gain ratio in a decision tree model?
+ What is Gini Impurity in a decision tree model?
+ What are the advantages of a decision tree model?
+ What are the disadvantages of a decision tree model?
+ How to prevent a decision tree from overfitting?
+ How does the support vector machine algorithm work?
+ What are margin, soft margin, and support vectors for support vector machine?
+ How do C and Gamma affect a support vector machine model?
+ What is the kernel trick for support vector machine model?
+ What is the hinge loss of a support vector machine model?
+ What are the advantages of a support vector machine model?
+ What are the disadvantages of a support vector machine model?
+ Explain the K-Nearest Neighbors (KNN) algorithm?
+ What is euclidean distance?
+ What is manhatan distance?
+ What is cosine similarity?
+ What is the impact of adding another predictor in your regression model?
***
## Unsupervised learning
### K-means
+ Clustering
+ group similar data points together and discover underlying patterns
+ Algorithm
+ randomly initialize $k$ cluster centroids as ${\mu_1, \mu_2, ..., \mu_k}$
+ repeat until convergence
+ for $i$ in $1, 2, ..., n$:
+ $c_i:=$ index (from $1$ to $k$) of cluster centroid closest to $x_i$
+ for $j$ in $1, 2, ..., k$:
+ $\mu_j:=$ mean of points assigned to cluster $k$
+ (special handle) centroid with no data
+ remove the centroid, end up with $k-1$ clusters
+ randomly reinitialize it (not sure when)
+ Optimization
+ distortion cost function: $L=\frac{1}{n}\sum_{i=1}^n\Vert{x_i-\mu_{c_i}}\Vert^2$
+ $\min_{\{c_1, ..., c_n\}, \{\mu_1, ..., \mu_k\}}L$
+ Random initialization
+ choose $k$ that $k<n$
+ randomly pick $k$ training examples and set $\mu_1$ to $\mu_k$ to these examples' values
+ do multiple random initializations to avoid local optimum
+ typical numbers: 50~1000
+ pick the clustering which gave the lowest distortion
+ when $k>10$, multiple random initializations are less likely to necessary
+ How to choose $k$
+ Elbow method

+ vary $k$ and compute $L$ at a range of $k$ values
+ plot $L$~$k$
+ choose the "elbow" number of clusters
### Hierarchical clustering
+ Types
+ agglomerative (bottom-up, common)
+ start: each point is its own cluster
+ iteratively merge the closest pair of clusters
+ stop: when only one cluster remains or a stopping condition is met
+ divisive (top-down)
+ start: all points in on big cluster
+ recursively split the cluster into smaller ones
+ less common due to higher computational cost
+ Algorithm (agglomerative)
1. compute the distance matrix between all data points
2. treat each data point as a singleton cluster
3. merge the two closest clusters based on a linkage criterion
+ single linkage: shortest distance between any pair of points
+ complete linkage: farthest distance
+ average linkage: mean of all pairwise distances
+ ward's linkage: minizes the variance within clusters

4. repeat steps 2-3 until desired cluster count is reached or only one cluster remains
+ Height
+ refers to the distance between the two clusters being merged at each strp
+ similar cluster merge: low heights
+ determined by the linkage function
+ e.g. A: (1, 2), B: (10, 11)
+ single linkage: 10-2=8
+ complete linkage: 11-1=10
+ where to cut to form clusters: where distance starts increasing sharply
+ Pros
+ no need to specify number of clusters in advance
+ interpretable hierarchical structure
+ different distance matrix & linkage methods
+ good for small and medium datasets
+ Cons
+ not good for scalability (worse time & memory complexity)
+ irreversible
+ sensitive to noise and outliers
### DBSCAN
+ Use density to find arbitrarily shaped clusters
+ Groups together points that are close to each other (dense regions)
+ Marks points in low-density regions as outliers
+ Key concepts
+ $\epsilon$: radius of neighborhood around a point
+ $\t{minPts}$: minimum number of points required to form a dense region
+ point categories
+ core point: has at least $minPts$ points within $\epsilon$
+ border point: within $\epsilon$ of a core point, but has fewer than $\t{minPts}$ neighbors
+ noise point: neither core nore border
+ How it works
1. for each points:
+ if it's a core point, form a new cluster and expand it
+ add directly density-reachable points (i.e. within $\epsilon$)
+ recursively add reachable core points and their neighbors
2. if a point is noise, mark it but ignore it from clusters
+ Example
+ A (1, 2), B (2, 2), C (2, 3), D (8, 7), E (8, 8), F (25, 80), G (8, 9), H (7, 8)
+ $\epsilon$=2, $\t{minPts}$=3
+ step by step
1. point A has 2 neighbors: B and C -> total 3 points -> core, form cluster 1
2. expand from A: include B and C
3. B and C also have neighbors A and each other -> part of cluster 1
4. D has neighbors E, G, H -> 4 points -> core, form cluster 2
5. expand from D to E, G, H -> all become part of cluster 2
6. F has no nearby points -> noise
+ output cluster 1: (A, B, C), cluster 2: (D, E, G, H), noise: F

+ Pros
+ does not require number of clusters as input
+ can find arbitrarily shaped clusters
+ robust to noise and outliers
+ good for spatial or irregular data
+ Cons
+ difficult to choose hyperparameters ($\epsilon, \text{minPts}$)
+ sensitive to the scale of data (needs normalization)
### Interview questions
+ Explain the K-means algorithm?
+ What is an autoencoder?
***
## Regularization
+ Overfitting
+ underfitting: high bias
+ fit more complex model
+ overfitting: high variance
+ bad for generalization
+ address overfitting
+ reduce numbers of feature
+ (-) lose information
+ regularization
+ keep all features, reduce magnitude of $\theta$
+ Parameter norm penalty
+ $L_{new}=L+\lambda\Omega(\theta)$ with $\lambda\geq{0}$
+ $\lambda=0$: no regularization
+ $\lambda\rightarrow\infty$: minimize $\Omega(\theta)$
+ L2 regularization
+ $\Omega(\theta)=\frac{1}{2}\Vert\theta\Vert_2^2$
+ $\theta$ have small norm
+ (-) model complexity $\downarrow$ but number of variables keeps the same ($\theta\neq0$)
+ linear regression example
+ $L_{new}=\frac{1}{2n}\sum_{i=1}^{n}(h_{\theta}(x_{i})-y_{i})^2+\lambda\sum_{j=1}^{k}\theta_{j}^2$
+ gradient descent
+ $\theta=\theta(1-\alpha\frac{\lambda}{n})-\frac{\alpha}{n}\sum_{i=1}^n(h_{\theta}(x_{i})-y_{i})x_{i}$
+ $(1-\alpha\frac{\lambda}{n})$ shrinks $\theta$
+ L1 regularization
+ $\Omega(\theta)=\Vert{\theta}\Vert_1=\sum_{i}|\theta_i|$
+ (+) sparse solution, some $\theta_i=0$ (feature selection)
+ Elastic Net
+ combine L1 and L2 regularizations
+ address limitations of lasso regression (L1)
+ pick one param and ignore others with high collinearity
+ $\Omega(\theta)=\lambda_2\Vert\theta\Vert_2^2+\lambda_1\Vert\theta\Vert_1$
### Interview questions
+ What's the difference between L1 (LASSO) and L2 (Ridge) regression?
***
## Ensemble Learning
### Bagging
+ bootstrap aggregating
+ focus at getting a model with lower variance
+ (+) generalization
+ fit the different learners independently (can be parallelised)
+ bootstrapping
+ generate sample of size $B$ from an initial dataset of size $N$ by randomly drawing with replacement $B$ observations
+ representative and independent samples of true data distribution (almost i.i.d samples)
+ algorithm
1. create multiple bootstrap samples (almost independent datasets)
2. fit weak learner for each of samples
3. aggregate learners to be the final model
+ regression: average the outputs
+ classification
+ hard-voting: return the class that receives the majority of the votes
+ soft-voting: average the probability for each class then return the one with highest average probability
### Boosting
+ multiple weak classifiers $\rightarrow$ strong classifier
+ focus at lowering bias
+ can not be done in parallel
+ algorithm
1. build the first model from the training data
2. create the second model that attempts to correct the errors from the first model
3. repeat multiple times until the training set is predicted perfectly or a maximum number of models is reached
+ AdaBoost
+ most common: AdaBoost + decision stump (decision tree with one level)
+ algorithm
1. assign data points with same weights, $w_i=\frac{1}{n},i=1,...,n$
2. for $t=1$ to $T$:
+ fit a classifier $g_t(x)$ using $w_i$
+ compute $\epsilon_t=\frac{\sum_{i=1}^n{w_{i}I(y_i\neq{g_t(x_i)})}}{\sum_{i=1}^n{w_i}}$
+ compute $\alpha_t=\ln\frac{1-\epsilon_t}{\epsilon_t}$
+ set $w_i\leftarrow{w_i}*\exp[-\alpha_t{g_t(x_i){y_i}}];i=1,...,n$
3. output $G(x)=sign[\sum_{t=1}^T\alpha_t{g_t(x)}]$
+ weighted avg for regression
+ Gradient boosting
+ from high bias to low bias model
+ focus on residuals from the previous model
+ negative gradient: directions & magnitude loss function can be minimized
+ algorithm
1. initial model with constant value $F_0(x)=\arg\min_\theta\sum_{i=1}^nL(y_i,\theta)$
2. for $t=1$ to $T$:
+ compute residuals $r_{it}=-[\frac{\partial{L(y_i,F(x_i))}}{\partial{F(x_i)}}]_{F(x)=F_{t-1}(x)};\forall{i=1,...,n}$
+ train a tree $h_t(x)$ to residuals, i.e. train using set $\{(x_i, r_{it})\}_{i=1}^n$
+ compute $\theta_{t}=\arg\min_{\theta}\sum_{i=1}^nL[y_i,F_{t-1}(x_i)+\theta{h_t(x_i)}]$
+ update model $F_t(x)=F_{t-1}(x)+\theta_{t}h_t(x)$
3. output $F_T(x)$
+ Pros
+ high accuracy
+ handle different data types
+ feature importance
+ reduce bias and variance
+ Cons
+ training time could be slow (especially with large datasets and many trees)
+ prone to overfitting
+ less interpretable
+ memory usage
+ complex tuning (hyperparameters)
+ XGBoost (Extreme Gradient Boosting)
+ scalable and accurate
+ focuses on speed and performance
+ level-wise
+ grows all leaves at the same depth
+ more conservative -> less risk of overfitting
+ LightGBM
+ histogram-based algorithms for faster training on large datasets
+ leaf-wise
+ grows the leaf with the largest loss reduction
+ can lead to deeper trees
+ pros
+ faster training speed
+ lower memory usage
+ better accuracy
+ compatibility with large datasets
+ structural differences with XGBoost
+ LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for **finding a split value** while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split
+ Level-wise and leaf-wise
+ level-wise
+ expands all nodes level by level
+ in each iteration, all leaves at the current depth are split
+ tends to create balanced trees
+ pros
+ easier to control overfitting
+ more stable and conservative
+ cons
+ computation on splits that don't reduce much error
+ slower and less accureate in some cases
+ leaf-wise
+ always splits the leaf with the maximum loss reduction
+ can result in deep, unbalanced trees
+ more focuses on reducing training error quickly
+ pros
+ faster and more accurate
+ focuses computation where it reduces error most
+ cons
+ more prone to overfitting
+ need constrains like "max_depth" or "num_leaves"

### Stacking
+ heterogenous weak learners (different learning algorithms)
+ combine learners by training meta-model
+ algorithm (use 2 folds as example)
1. split training data into 2 folds
2. choose $L$ weak learners and fit them to data of the first fold
3. for each of the $L$ weak learners, make predictions for observations in the second fold
4. fit the meta-model on the second fold, using predictions made by the weak learners as inputs
+ multi-levels stacking
+ 3-levels stacking
+ first layer: fit the $L$ weak learners
+ second layer: fit $M$ meta-models
+ third layer: fit a last meta-model that takes as inputs the predictions returned by the $M$ meta-models of previous layer
### Interview questions
+ What is bagging v.s. boosting?
+ How bagging and boosting model changes in variance and bias when the sample size increases?
***
## Normalization
### Pros
+ improve model performance
+ each feature is treated equally during training
+ reduece the impact of outliers
+ ensure data is on the same scale
+ faster convergence
+ gradients computed during backpropagation are more consistent across different features
### Relationship with various algoritms
+ Gradient descent-based algorithms
+ require data to be scaling
+ features with different ranges could cause different step sizes for each feature
+ Distance-based algorithms
+ use distances between data points to determine similarity
+ sensitive to the range of features
+ Tree-based algorithms
+ fairly insensitive to the scale of features
### Methods
+ Z normalization (standardization)
+ $\hat{X_i}=\frac{X_i-\mu}{\sigma}$
+ pros
+ mean $=0$ & std $=1$
+ used when data is gaussian
+ less sensitive to outliers
+ cons
+ unbounded range
+ Min-Max normalization
+ $\hat{X_i}=\frac{X_i-\min{X}}{\max{X}-\min{X}}$
+ pros
+ rescale data to a range $=[0, 1]$
+ preferred when min and max are known and meaningful
+ useful for algorithms that assume data is in a uniform scale (NN, kNN)
+ cons
+ sensitive to outliers
+ assume the min and max do not change over time
+ Unit vector normalization
+ $\hat{X_i}=\frac{X_i}{\Vert{X}\Vert}$
+ pros
+ shrink data to a unit sphere
### Interview questions
+ What are the benifits of scaling dataset for machine learning models?
***
## Evaluation
### Overfitting and underfitting

+ Overfitting
+ capture the noise of the data
+ complicated model
+ high variance & low bias
+ prevention
+ simpler model
+ cross validation
+ more data
+ feature selection
+ early stopping
+ regularization
+ ensemble methods
+ Underfitting
+ cannot capture the underlying trend of data
+ not enough data
+ too simple model
+ low variance & high bias
+ prevention
+ more data
+ more complicated model
### Evaluation metrics
+ Accuracy
+ $\f{\t{# Correct Predictions}}{\t{# Total Predictions Made}}=\f{TP+TN}{TP+TN+FP+FN}$
+ work well when balanced samples
+ Log loss
+ $-\f{1}{n}\S{i=1}{n}\S{j=1}{k}{y_{ij}\log{p_{ij}}}$
where,
$y_{ij}$ indicates whether sample $i$ belongs to class $j$
$p_{ij}$ indicates the probability of sample $i$ belonging to class $j$
$n$ is sample size
$k$ is the number of classes
+ range $[0, \infty)$
+ nearer to 0: higher accuracy
+ away from 0: lower accuracy
+ AUC - ROC (Area Under ROC Curve)

+ for binary classification
+ plots FPR (x-axis) & TPR (y-axis) combinations at different thresholds
+ False Positive Rate (FPR) $=\f{FP}{FP+TN}$
+ True Positive Rate (TPR) $=\f{TP}{TP+FN}$
+ $[0.5, 1]$, the greater the area, the better the performance of model
+ Kolmogorov-Smirnov (KS) chart
+ evaluate how well model distinguishes between two classes
+ distance between positive & negative class distributions (CDF)
+ x-axis: the probability of being classified as "positive"; y-axis: cummulated counts (or %)
+ range $[0, 1]$, the greater the better

+ KS statistic
+ maximum difference between the cumulative distribution functions (CDFs) of positive and negative classes
+ $KS=\max_x|F_1(x)-F_0(x)|$
+ strong model: 0.6+
+ Precision (Specificity)
+ the percentage of correctly predicted out of all the predicted positive events
+ $\f{TP}{TP+FP}$
+ should be used when the cost of false positive is high
+ Recall (sensitivity)
+ the percentage of positive events captured out of all positive events
+ $\f{TP}{TP+FN}$
+ should be used when the cost of false positive is low, but the reward for true positive is high
+ precision-recall curve
+ evaluate the quality of the model by measuring how well it can identify positive cases while avoiding false positives
+ precision (y-axis) & recall (x-axis)
+ F1 score
+ $\f{2}{1/\t{precision}+1/\t{recall}}$
+ range $[0, 1]$, the greater the better
+ harmonic mean between precision & recall
+ different types of average
+ macro average
+ unweighted average, treats all the classes equally, regardless of the number of samples in each class
+ weighted average
+ weighted by the number of samples in each class
+ more useful when the dataset is imbalanced
+ micro average
+ global average
+ count the sums of TP, FP, TN, FN
+ Mean absolute error (MAE)
+ $\f{1}{n}\S{i=1}{n}|y_i-\hat{y_i}|$
+ no information of direction
+ Mean square error (MSE)
+ $\f{1}{n}\S{i=1}{n}(y_i-\hat{y_i})^2$
+ easier to compute gradient (compared to MAE)
+ Matthews correlation coefficient (MCC)
### Interview Questions?
+ What is bias variance tradeoff?
+ bias
+ model's ability of fit the training data well.
+ variance
+ model's ability to generalize well to the new data.
+ What is overfitting?
+ How to correct overfitting?
+ What is ROC or AUC?
+ What is precision?
+ What is recall?
+ What is a precision-recall curve?
+ What is F1 score?
+ What is the difference between macro and weighted average?
+ What is MCC?
***
## Principal Component Analysis (PCA)
### What is PCA?
+ Transform high-dimensional data into lower-dimensions while retaining as much information as possible
+ Orthogonal linear transformation that transforms data to a new coordinate system
+ the greatest variance by some scalar projection of the data comes to lie on the first coordinate (PC1)
+ the second greatest variance on second coordinate (PC2), and so on
### How does PCA work?
+ Understand data
+ the greater the variance, the more the information
+ variance is an objective and way to quantify the amount of information in our data
+ Summarize data
+ combine variables to create brand new variables, PC1, PC2, ..., and so on
+ e.g. PC1 = 70% height + 30% weight
+ choose the least number of principal components that would explain the most amount of our original data
+ visual aid: scree plot

+ bar chart: the proportion of variance explained by each of the principal components
+ line chart: cumulative sum of explained variance
+ Procedure
1. take the whole dataset consisting of $d+1$ dimensions and ignore the labels s.t. our new dataset becomes $d$ dimensional
2. compute the mean for every dimension of the whole dataset
3. compute the covariance matrix of the whole dataset
4. compute eigenvectors and the corresponding eigenvalues
+ eigenvectors: principal components
+ eigenvalues: the amount of variance explained by each principal component
+ $\det(\mbf{A}-\lambda\mbf{I})=0$
+ $\mbf{A}v=\lambda{v}$
5. sort the eigenvectors by decreasing eigenvalues and choose $k$ eigenvectors with the largest eigenvalues to form a $d\times{k}$ dimensional matrix $\mbf{W}$
+ eigenvectors will form the axes of new feature subspace, but only define the directions because they have all the same unit length 1
+ the eigenvectors with the lowest eigenvalues bear the least information
6. use this $d\times{k}$ eigenvector matrix to transform the samples onto the new subspace
+ $x_{new}=\mbf{W}^Tx$
### Interview questions
+ What is PCA?
***
## Feature Selection
### Filter methods
+ Measure the relevance of features based on statistical properties
+ Common methods
+ Pearson's correlation coefficient
+ measure the linear relationship between two continuous variables
+ $[-1, 1]$
+ Spearman's correlation coefficient
+ measure the monotonic ranking relationship between two continuous or ordinal variables
+ $[-1, 1]$
+ 1: strong positive correlation
+ -1: strong negative correlation
+ 0: no correlation
+ Chi-Squared test
+ measure the relationship between categorical variables
+ p-value $\da$ indicates a significant relationship between variables
+ mutual information
+ quantifies the amount of information obtained about one variable through the other variable
+ 0: variables are independent
+ higher value: stronger dependence
+ variance threshold
+ removes features with low variance
### Wrapper methods
+ Using a specific machine learning model to evaluate the importance of features
+ Common methods
+ forward feature selection
+ starts with empty set of features and iteratively adds the most significant features based on a given evaluation metric
+ backward feature elimination
+ starts with all features and iteratively removes the least significant features based on a given evaluation metric
+ recursive feature elimination
+ ranks features based on their importance assigned by the model, removes the least important feature, and refits the model in each iteration
+ exhaustive feature selection
+ evaluates all possible feature combinations and selects the best subset
+ computationally expensive
### Embedded methods
+ Combines the qualities of both filter and wrapper methods by incorporating feature selection as part of the model training process
+ Selects features based on the model's internal evaluation
+ Common methods
+ LASSO regression
+ ridge regression
+ Elastic net
+ random forest / XGBoost
### Interview questions
+ How would you do feature selection?
***
## Time Series
### ARIMA
+ AR (AutoRegressive)
+ uses past values to predict the current value
+ $y_t=\phi_1y_{t-1}+\phi_2y_{t-2}+...+\epsilon_t$
+ I (Integrated)
+ make data stationary by differencing
+ $y_t'=y_t-y_{t-1}$
+ MA (Moving Average)
+ uses past errors (residuals) to correct predictions
+ $y_t=\theta_1\epsilon_{t-1}+\theta_2\epsilon_{t-2}+...+\epsilon_t$
+ ARIMA(p, d, q)
+ $y_t'=(1-B)^dy_t$
+ $y_t'=\phi_1y_{t-1}'+\phi_2y_{t-2}'+...+\phi_py_{t-p}'+\epsilon_t+\theta_1\epsilon_{t-1}+...+\theta_q\epsilon_{t-q}$
+ $\Phi(B)(1-B)^dy_t=\Theta(B)\epsilon_t$
+ p: number of AR terms
+ d: number of differences
+ q: number of MA terms
+ Assumptions
+ stationarity: mean, variance, and autocorrelation are constant over time
+ ADF test (H1: the series is stationary)
+ linearity
+ no seasonality
+ erros are white noise
+ Choose p & q
+ ACF (autocorrelation function) for q
+ PACF (partial autocorrelation function) for p
+ AIC/BIC (lower is better)
+ When to use
+ data has a trend, but not strong seasonality
+ simple & interpretable model
+ Pros
+ interpretable
+ good for short-term forecasting (linear & no strong seasonality)
+ low data requirements
+ Cons
+ not good with seasonality
+ only for univariate data
+ assumes linear relationships
+ parameter tuning
+ ARIMAX (ARIMA with eXogenous variables)
+ ARIMA model + external features
+ ARIMA model only use past y & errors
+ $y_t'=\t{AR terms}+\t{MA terms}+\beta{x_t}+\epsilon_t$
### SARIMA
+ ARIMA with seasonal components
+ SARIMA(p,d,q)x(P,D,Q,s)
+ $\Phi_P(B^s)\phi(B)(1-B)^d(1-B^s)^Dy_t=\Theta_Q(B^s)\theta_q(B)\epsilon_t$
+ P,D,Q: seasonal AR, I, MA orders
+ s: seasonal period
+ e.g. 12 for monthly data with yearly seasonality
+ When to use
+ time series has seasonal patterns
+ ACF/PACF plots show spikes at seasonal lags
+ standard ARIMA isn't capturing repeated cycles
+ Pros
+ handles both trend & seasonality
+ interpretable
+ Cons
+ more parameter -> harder to tune
+ assumes fixed seasonality
### Exponential smoothing models
+ Exponential smoothing
+ uses when data has no trend and no seasonality
+ $\hat{y}_{t+1}=\alpha{y_t}+(1-\alpha)\hat{y}_t$
+ Holt's linear trend method
+ uses when data has trend but no seasonality
+ level: $l_t=\alpha{y_t}+(1-\alpha)(l_{t-1}+b_{t-1})$
+ trend: $b_t=\beta(l_t-l_{t-1})+(1-\beta)b_{t-1}$
+ forecast: $\hat{y}_{t+h}=l_t+hb_t$
+ Holt-Winters method
+ uses when data has trend & seasonality
### Prophet model
+ open-source forecasting tool
+ trend / seasonality / holiday or special events / missing data or outliers
+ $y(t)=g(t)+s(t)+h(t)+\epsilon_t$
+ $g(t)$: trend component (linear or logistinc growth)
+ $s(t)$: seasonal component (yearly, weekly, daily)
+ $h(t)$: holiday effects (custom events)
+ $\epsilon_t$: random noise
+ Pros
+ easy to use
+ automatically handles trend & seasonality
+ robust to missing data & outliers
+ support holidays & special events
+ interpretable
+ each component can be visualized and understood separately
+ quick to train
+ work well for business time series
+ Cons
+ less accurate on complex patterns (e.g. nonlinear or interactions)
+ designed for univariate data
+ assumes additive structure
+ need enough data
+ seasonality must be regular
### Python library
+ data preprocessing
+ pandas: resample, shift, rolling, ...
+ datetime
+ modeling
+ statsmodels: ARIMA, SARIMA, Exponential Smoothing, ADF test
+ pmdarima: Auto-ARIMA
+ scikit-learn: regression, classification, feature scaling
+ Prophet: prophet model
+ TensorFlow & PyTorch: LSTM
***
## Causal Inference
### Motivations
+ reply on causation to make things happen
+ observation: x% users churned last month
+ causation: because of Y (e.g. pricing, feature)
+ action: change Y to prvent churn
+ A/B testing doesn't always work
+ social media: user talk
+ marketplacts (Uber, DoorDash, Airbnb): resources shared
### Problems with observational data
+ e.g. How does harmful content in News Feed impact users?
+ News Feed is personalized
+ exposure level, engagement level

+ selection bias
+ observation (what we get): E[engagement of exposed users] - E[engagement of unexposed users]
+ causal effect (what we want): E[engagement of exposed users] - E[engagement of exposed users had they not been exposed]
+ selection bias (observation-causal effect): E[engagement of exposed users had they not been exposed]-E[engagement of unexposed users]
+ positive bias: treatment is better off anyways (illusory causation)
+ negative bias: treatment is worse off to begin with (hide true causation)
### Regression
+ model: engagement ~ exposure + age + country + education + ...
+ be sure to include all relevant controls
+ interpret partial slope
+ how much being exposed to harmful content affects engagement when other control variables are held at any constant values
+ $\hat{\t{engagement}_0}=b_0+b_1(\t{exposure=0})+b_2\bar{\t{controls}}$
+ $\hat{\t{engagement}_1}=b_0+b_1(\t{exposure=1})+b_2\bar{\t{controls}}$
+ $b_1=\hat{\t{engagement}_1}-\hat{\t{engagement}_0}$

+ pitfalls in regression
+ ommitted-variable bias: relevant variables left out
+ functional form: controls have non-linear influence on outcome
+ causal structure
+ mediator (middle of a chain): control # posts shared $\ra$ # of friends no longer has influence on exposure to harmful content
+ colliders (common effect): look time spend only late at night $\ra$ "spurious correlation" between popularity and insomnia

### Matching


### Counterfacturals
+ just a few treated units
+ rare: happend infrequantly
+ large units: impact entire city, country, states
+ e.g. policy change, elections, pandemics
+ regression/matching wouldn't work
+ counterfactural theory
+ X causes Y: if X had not occured, Y would not have occured
### Difference-in-differences

+ check assumptions

### Synthetic control


+ how to decide weights

### Regression Discontinuity Design (RDD)
+ natural experiments
+ happend only once in lifetime









+ instrumental variables

### Summary

+ https://www.yuan-meng.com/posts/causality/
### Interview questions
+ What is a Directed Acyclic Graph (DAG)?
+ What are confounders in causal inference?
+ What is counterfactual in causal inference?
+ What is Average Treatment Effect (ATE) in causal inference?
+ What is one-to-one confounder matching for causal inference?
+ What are the differences between one-to-one confounder matching and propensity score matching (PSM) for causal inference?
+ What is Inverse Probability Treatment Weighting (IPTW) in causal inference?
+ What is difference-in-difference for causal inference?
+ What are the assumptions for causal inference?
+ How to use an instrumental variable for causal inference?
***
## Practice
### Interview questions
+ How to handle missing data in a modeling dataset?
+ remove rows with missing values.
+ mean/median/mode imputation.
+ reduce variablity but introduce bias.
+ regression imputation.
+ predict missing values.
+ (+) maintain relationships between variables in the dataset.
+ might lead to overfitting.
+ How to identify outliers?
+ 1.5 * IQR.
+ prediction.
+ How to deal with outliers in a modeling dataset?
+ remove outliers.
+ cap or truncate outliers.
+ transform the data.
+ use robust statistical methods.
+ investigate the cause.
+ What are the methods to handle imbalanced data?
+ oversampling.
+ undersampling.
+ cost-sensitive learning.
+ algorithm robust to imbalanced data. (boosting)