# Machine Learning Share (2018/04/24)
###### tags: `shared` `technical`
[TOC]
## Curse of dimensionality
+ Think of distances between samples measured using features
+ **High dimensional feature $\rightarrow$ far away**
+ All the samples are ALL far away
+ No clue for prediction/clustering
+ Need an efficient way to select useful/informative features
## Feature selection v.s. feature extraction
+ Feature **selection**
+ Original features "preserved"
+ Know which ones to **keep** and which ones to **drop**
+ e.g., Recursive Feature Elimination (RFE)
+ Feature **extraction**
+ Original features not "preserved"
+ Features are **transformed** into scores or indicies
+ e.g., Principal Component Analysis (PCA)
## Three categories for feature selection
+ Filter methods
+ Each feature is fed into a **scoring function**
+ Often select the top-scoring features as final feature set
+ Scoring function is **independent** of the following classifier
+ e.g., majority vote with weights being p-values, SNRs, variances, etc.
+ Wrapper methods
+ Often adopted together with a classifier
+ The scoring function is "specialized" for the following classifier
+ e.g., RFE with SVM (SVM-RFE)
+ Embedded methods
+ Feature selection and classification are closely combined
## Examples of feature extraction
+ Restricted Boltzmann Machine (RBM)
+ Visible layer v.s. hidden layer
+ Visible $\rightarrow$ hidden & hidden $\rightarrow$ visible
+ Trained with VB
+ **Hidden layer $\rightarrow$ new features**
+ Reference (Wiki): https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine
+ Autoencoder (AE)
+ Latent code as new features
+ Many variants (with various objective functions)
+ Unsupervised self-learning
+ Can better utilize unlabeled samples
+ **Conditional variational autoencoder (CVAE)**
+ Detailed explanation: https://hackmd.io/s/Hyn2oZpdz
+ "Pre-defined" latent feature distribution
+ Easy to incorporate auxilliary information
+ **AIC/BIC/$C_p$**
+ Mallow's $C_p$ and Akaike's Information Criterion (AIC)
+ Bayesian Information Criterion (BIC)
+ Consider MSE and **model complexity** at the same time
+ Forward/Backward (stepwise) feature selection v.s. best subset selection
+ Other commonly-seen approaches
+ **Principal component analysis (PCA)**
+ SVM-RFE
+ Many filter methods
+ **LASSO (Least Absolute Shrinkage and Selection Operator)**
+ Objective function + penalty term
+ Regularized: $l_1-norm$
+ Compared to ridge regression: $l_2-norm$
+ $l_1$ has much more probability to make regression coefficients zeros