# Maching Learning Week 4 ## Tips for deep learning(Continue) * New activation function * Vanishing gradient problem * Solution: Rectified linear unit (ReLU) $\sigma(z)=[z>0]z$ * Gradients not be smaller. * Maxout: $\sigma$ is a max function. * ReLU is a special case of Maxout. * We can learn any piecewise linear convex function. * Adaptive learning rate * RMSProp * Error surface can be very complex when training NN. * ![](https://i.imgur.com/1e7KByo.png) * Ancient $g$ will exponentially decay. * Momentum * Movement of last step minus gradient at present * Movement not just based on gradient, but previous movement. * A way to go through local minima. * Adam * RMSProp + Momentum. * Early stopping * Testing loss may increase when too much epochs. * Do not early stop by minimize testing error. Use validation set instead. * Regularization * penalize large parameter. * Use $L_2$ regularization to weight decay. * $L_1$ regularization: magnitude decreases by a constant. (Sparse effect) * ![](https://i.imgur.com/Smre0Be.png) * Weight decay, just as our brain. * Dropout * Each neuron has some probability $p$ to dropout. * For each mini-batch, resample the dropout neurons. * No dropout at testing, and the weight times $1-p$. * When teams up, if everyone expect the partner will do the work, nothing will be done finally. * Dropout is a kind of ensemble. ## Principle component analysis * Unsupervised learning * Dimension reduction * Generation * Clustering * ![](https://i.imgur.com/RZYnKjV.png) * Hierachical Agglomerative Clustering(HAC) * Build a tree (connect the close two nodes) * Pick a threshold * Distributed representation * An object do not must belong to one cluster. * Feature selection v.s. PCA * PCA * Want to rigidly rotate the axes to newpositions with the following properties: * The preceding principle axis has the higher varience. * Covariance amoung principal axes is $0$. * The k'th principle component is the projection to the k'th principle axis. * Procedure: * Computer= the covarience matrix $\Sigma$ for the data. * diagonalize $\Sigma=U\Lambda U^T$, where $U$ is othogonal and $\Lambda$ is diagonal with entries $\lambda_1, \lambda_2,\cdots,\lambda_M$. * The k'th principle component is $u_k^Tx$ * Uncorrelation between components * ![](https://i.imgur.com/aRWCkBl.png) * Reconstruction error * Projecting to $S_{PCA}$ yields the minimum mean squared error among all possible m-dimensional subspaces.