# Machine Learning Week 3
## Logistic regression
### generative v.s. discriminative
* generative
* with some prior knowledge
* less training data
* more robust to noise
* discriminative
* usually beleived better
* find a divided line
### multi-class classfication
1. For class $Ci$, find $w^i$, $b_i$ s.t. $z_i=w^i\cdot x+b_i$.
2. **softmax**: $z_i\rightarrow e^{z_i}\rightarrow \frac{e^{z_i}}{\sum e^z} = y_i$. We can see taht $y$ is a probability.
3. Now we can calculate the cross entropy between $y$ and $\hat{y}$, where $\hat{y} = [0, 1, 0]$ if $x\in \text{class 2}$ for example.
### Limitation of logistic regression
* The boundery of loggostic regression is linear. Thus there are some data cannot separate by it.
* Use **feature transformation** to resolve.
* How to find a proper feature transformation? Use **cascading logistic regression model**, which is a very very simple neural network.
## Deep learning
### Neural network
* Neuron: a weighted sum of previous result adding biase, then go through a nonlinear transform.
* Different connection leads to different network structures.
* **Netework parameter** $\theta$ : all the weights and biases in the neurons.
### Fully connected feedforword network
* Every result is used in the neuron in the next layer.
* A function with a vector input and a vector output.
* Givem network structure, define **a function set**.
### Deep = many hiden layers
* AlexNet: 8 layers
* VGG: 19 layers
* GoogleNet: 22 layers
* ResidualNet: 152 layers (using residual connect)
### Matrix Operation
* weights and biases can be discribe as matrix operation between layers.
* Itis a function repeats $x\rightarrow\sigma(Wx+b)\rightarrow\sigma(W(\sigma(Wx+b)))+b)\cdots$.
* Matrix calculation can be accelerate by GPU.
### Output layer as Multi-class classifier
* softmax the output layer.
* Then do what we had learned.
### FAQ
* How many layers and neurons?
* try and error + intuition.
* Can the structure be automatically determined?
* Evolutionary artificial neural networks.
* Can we design the network structure?
* Graph: CNN.
### Loss
* **Total loss** $L$ : sum of the cross entropy.
* We want to find a function in the function set i.e. find the network parameters $\theta$ that minimize $L$.
* Use **gradient descent** on evry $w$ and $b$.
* **Backpropagation**: an efficient way to compute $\partial L/\partial w$ in neural network.
## Gradient descent
* A way to solve $\displaystyle\theta^*=\arg\min_{\theta}L(\theta)$.
* $\theta^1=\theta^0-\eta\nabla L(\theta^0)$.
### Learning rate $\eta$
* Can be see if it's suitable in the graph of function from doing times to the loss.
* **Adaptive learning rates**: reduce $\eta$ by some factor every few epochs. Eg.$\eta^t=\eta/\sqrt{t+1}$.
### Adagrad
* $w^{t+1}=w^t-\frac{\eta^t}{\sigma^t}g^t$, where $\sigma^t$ is the root mean square of the previous derivatives of $w$.
* Adagrad cosiders the **contrast** between gradients.
* The best step is $\frac{\text{|first derivative|}}{\text{second derivative}}$.
### Stocastic gradient descent
* Loss of only one example.
* A faster way in every update.
* Maybe cross some local min.
### Feature scaling
* Make different features have the same scaling.
* Standardize every dimension.
### Math guarantee
* Fist order Taylor expansion $\rightarrow$ minimize an inner product.
* May stuck at local min or saddle point.
## Back Propagation
* Using chain rule to calculate partial derivatives.
* Forward pass for $\partial z/\partial w$ and Backword pass for $\partial l/\partial z$.
* $\sigma'(z)$ is a constant.
## Tips for deep learning
* Do not always blame overfitting.
* Bad on training data:
* New activation function
* Vanishing gradient problem
* Solution: Rectified linear unit (ReLU) $\sigma(z)=[z>0]z$
* Gradients not be smaller.
* Maxout: $\sigma$ is a max function.
* ReLU is a special case of Maxout.
* We can learn any piecewise linear convex function.
* Adaptive learning rate
* Bad on testing data:
* Earli stopping
* Regularization
* Dropout