# Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
[toc]
[NOTES LINK](https://community.deeplearning.ai/t/course-2-lecture-notes/11866)

## BIAS
Error of training Data
## Variance
Error of Test data
We want low bias and variance
## Underfitting
Squared Error is high even for training data.
Accuracy for train and test data low
* High bias and high Variance
## Overfitting
wrt training data, predicted and actual points fits perfectly.
But for new test points model wont satisfy.
Training Data Accuracy high
Test data Accuracy low
* Low bias
* High Variance

Another example-

To reduce bias increase hiddden layers
To reduce variation increase data

## Regularization
Used to reduce variance
Basically u r penalizing higher values of w or b.
By increasing lambda u want weight value less


### Y regularization reduces variance
* By doing this u r nullifying the effect of nodes of hidden layer since weights are less. Hence moving towards simpler NN.
* Also lambda high, weight less --> z less for which it lies in linear region of activation function, hence acting as linear model which isnt able to fit nonlinear complicated decision.
the Frobenius norm formula should be the following:

### Dropout Regularization

Let's say that for each of these layers, we're going to-for each node, toss a coin and have a 0.5 chance of keeping each node and 0.5 chance of removing each node. So, after the coin tosses, maybe we'll decide to eliminate those nodes, then what you do is actually remove all the outgoing things from that no as well. So you end up with a much smaller, really much diminished network.

Y does dropout works??
Cant rely on any 1 feature, so has to spread out weights. basically minimizing weights
### Other regularization methods
In order to increase data-
* flip image horizontally
* rotate it randomly and zoom it

## Normalizing Input


## Vanishing/ Exploding gradients
W[l] > 1 //Exploding y increases exponentially
W[l] < 1 //Vanishing y decreases exponentially




## Optimization Algo
### Mini batch
Epoch is a word that means a single pass through the training set.
Consider training set of 5000000 examples. divide it as 1000 sets of 5000 samples.



### Exponentially moving avg
Avg = 1/(1-beta)
more beta, more smoother curve (more weight to precious days avg than current temp)

it acts as exponential decay function

We expect green graph with beta = 0.9 but get purple graph. bcoz initial temp = 0 due to which avg of first 2 ays.
For which bias is added

### Momentum
Basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead.

In practice, people don't usually do bias correction because after just ten iterations, your moving average will have warmed up and is no longer a bias estimate.

### RMSProp

In practice, epsilon= 10^-8 is added in denominator to avoid divide by 0 error.
### Adam optimization algo

Tune alpha for best result

### Learning Rate Decay
Decrease alpha after epoches


### Local Optima Problem


## Tuning Process
First preference to tune red then yellow then purple
