# Maching Learning Week 4
## Tips for deep learning(Continue)
* New activation function
* Vanishing gradient problem
* Solution: Rectified linear unit (ReLU) $\sigma(z)=[z>0]z$
* Gradients not be smaller.
* Maxout: $\sigma$ is a max function.
* ReLU is a special case of Maxout.
* We can learn any piecewise linear convex function.
* Adaptive learning rate
* RMSProp
* Error surface can be very complex when training NN.
* 
* Ancient $g$ will exponentially decay.
* Momentum
* Movement of last step minus gradient at present
* Movement not just based on gradient, but previous movement.
* A way to go through local minima.
* Adam
* RMSProp + Momentum.
* Early stopping
* Testing loss may increase when too much epochs.
* Do not early stop by minimize testing error. Use validation set instead.
* Regularization
* penalize large parameter.
* Use $L_2$ regularization to weight decay.
* $L_1$ regularization: magnitude decreases by a constant. (Sparse effect)
* 
* Weight decay, just as our brain.
* Dropout
* Each neuron has some probability $p$ to dropout.
* For each mini-batch, resample the dropout neurons.
* No dropout at testing, and the weight times $1-p$.
* When teams up, if everyone expect the partner will do the work, nothing will be done finally.
* Dropout is a kind of ensemble.
## Principle component analysis
* Unsupervised learning
* Dimension reduction
* Generation
* Clustering
* 
* Hierachical Agglomerative Clustering(HAC)
* Build a tree (connect the close two nodes)
* Pick a threshold
* Distributed representation
* An object do not must belong to one cluster.
* Feature selection v.s. PCA
* PCA
* Want to rigidly rotate the axes to newpositions with the following properties:
* The preceding principle axis has the higher varience.
* Covariance amoung principal axes is $0$.
* The k'th principle component is the projection to the k'th principle axis.
* Procedure:
* Computer= the covarience matrix $\Sigma$ for the data.
* diagonalize $\Sigma=U\Lambda U^T$, where $U$ is othogonal and $\Lambda$ is diagonal with entries $\lambda_1, \lambda_2,\cdots,\lambda_M$.
* The k'th principle component is $u_k^Tx$
* Uncorrelation between components
* 
* Reconstruction error
* Projecting to $S_{PCA}$ yields the minimum mean squared error among all possible m-dimensional subspaces.