Try   HackMD

IntroNN-Week 2-Building Blocks of Neural Networks.

Problem in non-convex, it's to find local minimum and global minimum, because it's not visible. The learning rate of the gradient can take any way to descend, not in a restrictive way determined.

In the left graph the slope goes from negative and goes increasing(in multidimension convex function, any direction you move, the slope goes always increasing).

In the right graph the slope goes at the begin increasing and then decreasing(in non-convex function).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The lower the local minimum is the better the algorithm is.
2 important Things:
- We don't need to know and explore 100% of the surface in a non-convex function.
- We need to get the most quickly search of the local mininum

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Stochastic gradient descent with momentum

It's a Gradient descent updated mechanism

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

If the learning rate(eta) is too high: It gets situation like this. And if the learnig rate is too low we get a stituation of little steps and the algorithm takes more time.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

If we use a big ball that roll down we get the next(green):
The ball goes from the top with a high momemtun that carry out it to the next level or second valley.

In the moment when the ball reaches the first valley there is something calcuated and called:
Marginal local minimum: or a okay local minimum, that's little bit that their neighbours, but it could reaches more lowest values in the next valley.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This is also called: Ad-hoc method because it could works better than another methods.
The reason to be called so is:

  1. It could works alone better than other method even SGD
  2. But not always will be the best method because the problem could be different, but in some cases SGD with momemtum can be better (find a lower point) than SGD.

But you sacrifice a little the speed of converges into find another local minimum.

Now the gradient(green ball) always is influenced by the previous gradient by each iteration, that means, that in each iteration, the previous gradient influences to the next gradient.
This is shown in the green equation.
n actual iteration
n-1 previous iteration
alpha It's hypertunning parameter
g It's the gradient

You want a larger gradient than the previous time, for that you use alpha and 1-alpha

Many packages gives default values to alpha

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

In next chapters we will talk about eta as not a non-constant but as an adaptive learning rate

Test 2.1

1. A convex function cannot have more than one dimension?
Ans: False
Why?: A function can be 2d or 3d or have more than 3-dimensions.

2. Does momentum allow to accelerate the gradient descent algorithm?
Ans: True
Why?: It accelerates the gradient descent algorithm by considering the exponentially weighted average of the gradients.

3. Which of the following is true about a convex function?
Local minimum and global minimum are the same.

2.2 Other variants of Gradient Descent.

Adaptive Learning rate (Adagrad)

  1. you want to it becomes smaller and smaller

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  2. In non-linear surfaces, don't look the same values all the surface because in some places surface could be flat.

n: number of iteration
w: a vector
g: gradient, a vector

eta/root(s) will be different to the first, 2nd, 3th component and so on.. 100 component.

Remember: each gradient is the derivative of loss function of their parameter.

If n=20 we calculate the gradient number 20.

Whenever in ML, we use an epsilon(red), protect us, it -8,-16. You need to use a small number to protect or initial situations when the denominator can be zero.

S: By each gradient and the summation, the number becomes larger and larger and the learning rate(red circle) becomes a small value.

  1. The factor in red circle is different to g^n, So learning rate becomes a dimension-specific.
  2. Learning rate(red circle) decrease in every iteration.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Conclusion:

  • Adagrad This is useful and well for extremely complex convex surfaces as problems with gradient descend or SGD.
  • But it is not well for extremely non-complex convex surfaces.

The adaptation of adaptive gradients, is what you call RMS.

RMSprop(Root mean square propagation)

Due to Adagrad tends to get in its learning rate sometimes values of zero we use RMSprop. It's useful for extremely non-complex convex surfaces.

p: ro or previous

It's called exponential moving average of the gradient square or decaying average. Because it's like a windown that it's moving.

The formulate list the recent gradient g^n and the previous gradients, it gets with g^n-1, g^n-2

The larger is g^n, g^n-1 is a little less. much less g^n-2 and so on.

S is sum(g^n )^2 see the formula in previous graphic

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Conclusions
Adagrad: It's more aggresive
RMSprop: tone it down(bajar el tono)

Adam

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

It's been from 2014
It's very fast because it combines RMSprop(an fast algorithm) + SGD with momentum(That reaches local minimum far away).
But in sometimes SG with momentum can work better than Adam, Adam is not the last perfet, it couldn't reach to a lower local minimum.
But if Adam works well it is very fast and reaches lower local minimum

Test 2.2

1. Which of the below given are the different variants of optimizers?

  • Adagrad
  • RMSprop
  • Adam
  • All the above (Ans)

Why?: SGD with momentum, Adagrad, RMSprop, and Adam are the different variants of optimizers.

2. Does the learning rate in the Adagrad dynamically decrease as the algorithm proceeds?

Ans: True

Why?:The learning rate in the adagrad decreases for every iteration during the training process, and the learning rate here is dimension-specific.

2.3 Weight initialization and its Techniques.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

You always need to initialize the weights in order to reach a similarity between y and y^.

With a randomly initialization the algorithm could reach a good local minimum or bad local minimum. So all depend of locate of the start point.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

There is zero gradient in the (-) infinitive and zero gradient in the (+) infinitive, It's when the weights very large or very small, because it gives high values or null values.

So you need to put efforts in the middle (red circle) because is where the learning happens, it's where the values of weights are well balanced.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

In every layer you get a linear function, and you need an activation function that works in the middle point according to the weights (W's) values.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

So it's important have into accout!
Don't initialize with weight = 0, Backforward prograg. doesn't work with zero.

If W's are very large is a bad idea beacuse if X's are (+) or (-), it could give zero or too large values.

It's better to use a normal distribution for the weights where the variance to be not equal to zero, but variances is around the mean zero.

If W=Normal_distribution(0.1)x0.01 it's good

With this variation you get the best values for weights into a region(red circle) where the learning happens and be succesful.

ReLU, Where there's learning to the right and there's no learning to the left.

Xavier's paper
Xavier's white paper said this formulate at a beginning, where n is the # of nodes of the previous layer


If there are too many nodes or a few nodes, the linear function always will start with small weights and the values will not be exploding(+ right) or imploding(- left)

Today searchers has improved or re-formulate this postulation.
Refining and relaxing, or an actual improved formulate
Remember n # of nodes of the previous layer, sometimes it uses l to refer the layer.

Xavier initialization

read more how to use:
https://www.tensorflow.org/api_docs/python/tf/keras/initializers
https://stackoverflow.com/questions/33640581/how-to-do-xavier-initialization-on-tensorflow

He initiation It's for ReLU activation functions

This are in the sample codes.

Conclusion Xavier initialization works well and better than HE initialization, but this last sometimes also works well too.

Read more how to initialize weight is torch:
https://wandb.ai/wandb_fc/tips/reports/How-to-Initialize-Weights-in-PyTorchVmlldzoxNjcwOTg1
Torch was used in images MNIST example of this week2

Test 2.3

1. Which of the following activation functions are used with Xavier initialization?

  • Sigmoid
  • Tanh
  • Both A and B (Ans)
  • ReLU

2. Which are the following weight initialization technique are used for Relu activation?

  • Xavier initialization
  • HE initialization (Ans)
  • Both A and B
  • None of these

Another topics that we'll talk:
Overfitting and Regulatization

2.4 Regularization

To avoid the Overf. or Underf. we use a regulatization known as penalty.

W is a vector
Larger the W's the larger the penalty
Smaller the W's the smaller the penalty

Lasso: Also called the least absolute shrinkage or selection operator.
Their algorithm allow to NN to be explainable. Called: L1 Regularization

Ridge The weights can become so smooth
Teorically Ridge is more explainable. More popular to use, more arguable(discutible). Called: L2 Regularization

Data Augmentation

This is for create a synthetic version of a dataset. It's some similar to oversampling.

Let's we want to train a NN to identify numbers images lets say 7. And in extreme case the NN cannot recogninze the number 7 because of the dot in left corner. So it's neccesary to overfit to our data set.

We add to it different kind of noise as: more points, rotation, scaling, cropping(recorte), shifted, mirroring, color, GAN to recognize them in the training.
Take care with rotation: A dog with 90° of rotation is a dog, but number 6 with 90° of rotation is the 9.

GAN is a concept used in driverless cars, where it uses game engine or a virtual enviroment.
GAN: generative adversarial network
Read more GAN: https://www.tensorflow.org/tutorials/generative/dcgan

Data augmentation: The term is used to add noise, is not the same to oversampling, because it gets more samples from a little size if sample to balance them out(equilibrarlos).

We train with differents manipulation of images and position to allow to the NN recognizes efficiently.
(We dont' want to manipulate the images so that even humans cannot recognizes, but idea is get some images manipulation)

Conclusion: we need to recognize the noise because it could overfit the information, without we don't recognize the dot in the left corner.

2.4 Test

1. Which of the following is another name for L2 Regularization?

Lasso

Ridge (Ans)

Both A and B
Why?: L2-Regularization is also known as Ridge and L1 regularization as Lasso.

2. Data Augmentation technique is used on the images?
Ans: True
Why?: Data Augmentation is a regularization technique used on images that increase the amount of data by adding slightly modified copies of already existing data.

2.5 Dropout

We use dropout to train different models with different nodes than can give different results.
Example in class:
Let's say that in the class of "layer 1" this gives some nodes that give higher performance than another. So we do many trials by dropping some nodes and making groups of nodes. As we can see: Group1: Node 1-2, Group2: Node 1-4, Group: Node 3-5. This was possible by considering p=0.5

Another use of dropout

It is useful when some input nodes tend to induce a error to the next layers, so we need to drop out some nodes in the input as in hidden layer. But the gradient descend work same because we need to correct between y and y^.

The next values are suitable for each layer:
p=0.5 hidden layers.
p=0.8 input layer
In output layer is not necessary even if we use different outputs.

valina drop-out or inverse drop-out

How it works?
In training set we split the NN into group of nodes and its weights assigned. (w1,w2),(w1,w4)
Then in testing set we re-structure the total NN, but by dividing the previous weights w1,w2,w3.. for the respective p value in the layer.

Test 2.5

1. Which of the following is true about the Dropout technique?

I. Dropout is a regularization technique that reduces overfitting.

II. Dropout randomly drops the neurons according to the dropout ratio in the network during the training period.

III. Dropout is the same as the fully connected layer.

Ans: Both I and II statements are correct.

Why?: Dropout is a regularization technique used to prevent the model from overfitting. Dropout drops the number of neurons in the hidden layer according to the dropout ratio given by default ratio is 0.5.

2. Can the Dropout technique be implemented in the Output layer?

Why?: False
No, dropout can not be applied on the output layer since the output layer will be giving the probabilities and numerical values for classification and regression problems respectively, and dropout will randomly drop the neurons in the layer.

2.6 Batch Normalization.

Covariate shifts:
It happens when the inputs change and hence the weights in the layers change and it's due to the deviation of the inputs.

Batch Norm.. is scaled from the input datasets in the output of hidden layers of the entire Neural Networks, with scaling, the scaling has the nex characteristics:
Scaling:

  • Mean=0
  • Stad.dev=1

And it makes the layer more stable between the layers
this is like a regularization, this is for prevents the overfitting, making the model simpler.

We sample the randomly rows to create batches(lotes)

Mini-batch:
Every row in different mini-batchs gets different outcomes.

You use the next parameters because in the real world you don't uses mini-bachets, specially in test process.

Miu_populaion:
Sigma_population:

This concepts are important in ML.

See more in the summary of Batch Normalization

Test 2.6

2.7 Types of Neural Networks

Feed Forward network

left to right and a feedback from ouput to input.

Convolutional neural network (CNN)

Used in image or computer vision

The first node gets the pixel of a piece of a image, the next node gest an overlapping image

Recurrent Neural Network (RNN)

Know about node in the previous step. It captures what happen in the past or yesterday lets say. w3.z^old IT's called short term memory


LSTM (Long Short Term Memory Neural)

It remembers multiple past term, it has more memory.
It has the ability to learn how much memory you need to keep, and what part memory needs to be weighted in what fashion. W3.Z^old-1 , W3.Z^old-2, W3.Z^old-3
IF you fix this to a node, this is called as: Long Short term memory Neural Network

Generalized Adversarial networks

They are two different structures: Deep Neural Networks. Two indenpendent different CNN that working on testing each other. Each one has a job
Example: Forge(fundir) an image, think the first structure creates an image of a dollar bill and the second network test it if it's true or not

The two structures are trained simultaneously.

  1. It's trained by a dataset of images.
  2. It's tested by data or images of the real world
    If the second detects a different font or something in the corner of the bill, it tells to the first I don't see and the first structure start to correct in their creation of images.

Test 2.7

1. Which of the following are the different types of Neural Networks?

RNN
CNN
Feed Forward Neural Network
All the above (Ans)

2. For an image recognition problem, which of the following architectures of the neural network is best suited to solve the problem?

Perceptron
Convolutional Neural Network (Ans)
Multi-Layer Perceptron
LSTM

Why?: The convolutional neural network architecture is the best solution for solving any image recognition problem.