Machine Learning Share (2018/04/24)

# Machine Learning Share (2018/04/24) ###### tags: `shared` `technical` [TOC] ## Curse of dimensionality + Think of distances between samples measured using features + **High dimensional feature $\rightarrow$ far away** + All the samples are ALL far away + No clue for prediction/clustering + Need an efficient way to select useful/informative features ## Feature selection v.s. feature extraction + Feature **selection** + Original features "preserved" + Know which ones to **keep** and which ones to **drop** + e.g., Recursive Feature Elimination (RFE) + Feature **extraction** + Original features not "preserved" + Features are **transformed** into scores or indicies + e.g., Principal Component Analysis (PCA) ## Three categories for feature selection + Filter methods + Each feature is fed into a **scoring function** + Often select the top-scoring features as final feature set + Scoring function is **independent** of the following classifier + e.g., majority vote with weights being p-values, SNRs, variances, etc. + Wrapper methods + Often adopted together with a classifier + The scoring function is "specialized" for the following classifier + e.g., RFE with SVM (SVM-RFE) + Embedded methods + Feature selection and classification are closely combined ## Examples of feature extraction + Restricted Boltzmann Machine (RBM) + Visible layer v.s. hidden layer + Visible $\rightarrow$ hidden & hidden $\rightarrow$ visible + Trained with VB + **Hidden layer $\rightarrow$ new features** + Reference (Wiki): https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine + Autoencoder (AE) + Latent code as new features + Many variants (with various objective functions) + Unsupervised self-learning + Can better utilize unlabeled samples + **Conditional variational autoencoder (CVAE)** + Detailed explanation: https://hackmd.io/s/Hyn2oZpdz + "Pre-defined" latent feature distribution + Easy to incorporate auxilliary information + **AIC/BIC/$C_p$** + Mallow's $C_p$ and Akaike's Information Criterion (AIC) + Bayesian Information Criterion (BIC) + Consider MSE and **model complexity** at the same time + Forward/Backward (stepwise) feature selection v.s. best subset selection + Other commonly-seen approaches + **Principal component analysis (PCA)** + SVM-RFE + Many filter methods + **LASSO (Least Absolute Shrinkage and Selection Operator)** + Objective function + penalty term + Regularized: $l_1-norm$ + Compared to ridge regression: $l_2-norm$ + $l_1$ has much more probability to make regression coefficients zeros