# Machine Learning Week 3 ## Logistic regression ### generative v.s. discriminative * generative * with some prior knowledge * less training data * more robust to noise * discriminative * usually beleived better * find a divided line ### multi-class classfication 1. For class $Ci$, find $w^i$, $b_i$ s.t. $z_i=w^i\cdot x+b_i$. 2. **softmax**: $z_i\rightarrow e^{z_i}\rightarrow \frac{e^{z_i}}{\sum e^z} = y_i$. We can see taht $y$ is a probability. 3. Now we can calculate the cross entropy between $y$ and $\hat{y}$, where $\hat{y} = [0, 1, 0]$ if $x\in \text{class 2}$ for example. ### Limitation of logistic regression * The boundery of loggostic regression is linear. Thus there are some data cannot separate by it. * Use **feature transformation** to resolve. * How to find a proper feature transformation? Use **cascading logistic regression model**, which is a very very simple neural network. ## Deep learning ### Neural network * Neuron: a weighted sum of previous result adding biase, then go through a nonlinear transform. * Different connection leads to different network structures. * **Netework parameter** $\theta$ : all the weights and biases in the neurons. ### Fully connected feedforword network * Every result is used in the neuron in the next layer. * A function with a vector input and a vector output. * Givem network structure, define **a function set**. ### Deep = many hiden layers * AlexNet: 8 layers * VGG: 19 layers * GoogleNet: 22 layers * ResidualNet: 152 layers (using residual connect) ### Matrix Operation * weights and biases can be discribe as matrix operation between layers. * Itis a function repeats $x\rightarrow\sigma(Wx+b)\rightarrow\sigma(W(\sigma(Wx+b)))+b)\cdots$. * Matrix calculation can be accelerate by GPU. ### Output layer as Multi-class classifier * softmax the output layer. * Then do what we had learned. ### FAQ * How many layers and neurons? * try and error + intuition. * Can the structure be automatically determined? * Evolutionary artificial neural networks. * Can we design the network structure? * Graph: CNN. ### Loss * **Total loss** $L$ : sum of the cross entropy. * We want to find a function in the function set i.e. find the network parameters $\theta$ that minimize $L$. * Use **gradient descent** on evry $w$ and $b$. * **Backpropagation**: an efficient way to compute $\partial L/\partial w$ in neural network. ## Gradient descent * A way to solve $\displaystyle\theta^*=\arg\min_{\theta}L(\theta)$. * $\theta^1=\theta^0-\eta\nabla L(\theta^0)$. ### Learning rate $\eta$ * Can be see if it's suitable in the graph of function from doing times to the loss. * **Adaptive learning rates**: reduce $\eta$ by some factor every few epochs. Eg.$\eta^t=\eta/\sqrt{t+1}$. ### Adagrad * $w^{t+1}=w^t-\frac{\eta^t}{\sigma^t}g^t$, where $\sigma^t$ is the root mean square of the previous derivatives of $w$. * Adagrad cosiders the **contrast** between gradients. * The best step is $\frac{\text{|first derivative|}}{\text{second derivative}}$. ### Stocastic gradient descent * Loss of only one example. * A faster way in every update. * Maybe cross some local min. ### Feature scaling * Make different features have the same scaling. * Standardize every dimension. ### Math guarantee * Fist order Taylor expansion $\rightarrow$ minimize an inner product. * May stuck at local min or saddle point. ## Back Propagation * Using chain rule to calculate partial derivatives. * Forward pass for $\partial z/\partial w$ and Backword pass for $\partial l/\partial z$. * $\sigma'(z)$ is a constant. ## Tips for deep learning * Do not always blame overfitting. * Bad on training data: * New activation function * Vanishing gradient problem * Solution: Rectified linear unit (ReLU) $\sigma(z)=[z>0]z$ * Gradients not be smaller. * Maxout: $\sigma$ is a max function. * ReLU is a special case of Maxout. * We can learn any piecewise linear convex function. * Adaptive learning rate * Bad on testing data: * Earli stopping * Regularization * Dropout