How to set learning rates :
SGD
Adam
Adadelta
Adagrad
RMSProp
Stochastic Gradient Descent | Mini Batch Gradient Descent |
---|---|
Takes a point at random (stochastically) and computes gradient for that particular point | GD in batches of hundreds |
Much faster to compute | Although not as fast as SGD, it is also blazingly fast as we can parallelize computation |
Computationlly very cheap | Much better estimate of True Gradient |
Extremely stochastic (bad thing) | Smoother convergence, as stochasticity is significantly reduced |
Larger learning rates can result in huge unwanted learning rates | Allows for larger learning rate without the oscillations |
Contrains optimization to discourage complex models, To avoid overfitting
Improves eneralization of data on unseen data.
Dropout and Early Stopping are two ways we can implement regularization.
(image)
No relation between time steps t₀,t₁,t₂,…tₙ
ŷ=f(xₜ)
Does not take sequences into account
Relates networks computation at a particular time instant with its prior history, via a recurrance relation.
We do this via internal memory, that is state history.
ŷ=f(xₜ,tt-1)
RNN's have a state ht that is updated at each time step as a sequence is processed.
Apply recurrance relation at every time step to process a sequence.
ht=fw(wt,ht-1)
Where,
ht represents cell state
fw is a function with weights w
xt is the input
ht-1 is the previous state
(Add image showing all weights)
Why Accepts input from hidden layer and passes to the output layer
Whh Accepts input from hidden layers and passes to hidden layer
Wxh Accepts input layer and passes to the hidden layer
For all instances t0, t1, t2…tn, we compute individual losses and from these we get the combined loss (total loss)
As neural networks cannot understand words directly, we need to encode/represent words as numbers, for this we use embeddings, which transform indexes into vectors.
(image showing similar words are given similar vectors)
Words that are semantically similar in menaing will have similar encoding.
We also need to handle variable sequence lengths in Language processing, a simple feedforward neural net will fail the job, as we predefine the number of input neurons.
The food was good, not bad at all.
The food was bad, not good at all.
(image showing rnn gradient flow)
Exploding
Overflow => multiply series of big numbers, result goes to infinity.
Vanishing
Underflow => multiply series of small numbers (less than 1) and result tends to zero.
Vanishing gradients make it difficult to know which direction the parameters should move to improve the cost function while exploding gradients make learning unstable.
How to deal with these ?
For exploding gradients, make use of Gradient Clipping, which is defining a threshold and ignoring values that are bigger than the threshold.
For vanishing gradients, we have three solutions
Gates : To selectively add or remove information within each reccurrant unit. In other words, to optionally let information through the cell. Gates interact with each other to control information flow.
For LSTM's we have 4 gates :
Backpropogation Through Time (BPTT) becomes much more stable by using gates, by having fewer repeated matrix multiplications. This mitigates the issue of vanishing gradients in LSTM's.
To improve upon these shortcomings, we need to eliminate recurrance.
(image showing input stream, feature vector and output stream; that is without any notion of sequence)
One way to do this, is by feeding everything into a dense network (spoiler alert, this is a bad idea)
The idea of attention was first introduced back in 2017 (6yrs ago).
Main motive is to identify and attend to parts of input that are important; for which we make use of postional aware embedding.
Everytime we downscale our features, the filters get to attend to a larger region of input space. As we progressively go deeper and deeper into the network, downscaling at each step, the features are now capable of attending to the original image which was orignally very large (10⁶ pixels)
Our images and real data is highly non-linear and to capture the variance in data accurately, we make use of ReLU, which replaces all negative values by zero.
Applied after each convolution operation.
To reduce dimensionality, makes further computation easier.
To make model scalable and still preserving the spatial variance and spatial structure.
As the Filters go through the image, the central pixels get scanned more times as compared to the pixels on the sides, this might cause the image-intricaties on the centre to get more weightage as compared to the ones on the sides.
To avoid this we pad our image matrix with as the name suggests, zeroes.
Advanced things to do with cnn include, detecting objects and even getting the bounding box for each object.
Region proposals are used to localize objects within an image.
Using fast R-CNN (Regions with Convulation Neural Network)
Predict class for every single pixel, results in super-high output space. (one classification for each pixel)
First part of model is feature extraction, second part is the upscaling operation which is similar to the inverse of encoding of the images.
Downsampling and upsampling operations.
Upsampling is done with the help of transpose convulations, which are able to increase dimensions.