Optimizers - HackMD

# Optimizers ###### tags: `Data Science`, `Notes` ## Commonly Used Optimizers ### Batch Gradient Descent * Equation \begin{align}&w^{(t)}=w^{(t-1)}-\eta\nabla_wL(w)|_{w^{(t-1)}}\end{align}where * $\eta$ is the learning rate (default in Tensorflow/Keras $0.01$) * Properties * Slower compare to stochastic gradient descent and mini-batch gradient descent. * More accurate (for convex loss function) compare to stochastic gradient descent and mini-batch gradient descent. * Vunerable to __local minimum__. * Sensitive to learning rate $\eta$. ### Stochastic Gradient Descent * Equation \begin{align}&w^{(t)}=w^{(t-1)}-\eta\nabla_wL(w;x_i, y_i)|_{w^{(t-1)}}\end{align}where * the next $i^{(t+1)}=i^{(t)}+1$ * $\eta$ is the learning rate (default in Tensorflow/Keras $0.01$) * Properties * Update parameters using one single sample. (For each epoch, the parameters update $n$ times) * Faster compare to mini-batch gradient descent and stochastic gradient descent. * Less accurate (for convex loss function) compare to batch gradient descent and mini-batch gradient descent. * Vunerable to __local minimum__. * Sensitive to learning rate $\eta$. ### Mini-batch Gradient Descent * Equation \begin{align}&w^{(t)}=w^{(t-1)}-\eta\nabla_wL(w;x_{i:i+b}, y_{i:i+b})|_{w^{(t-1)}}\end{align}where * $b$ is the __batch size__ and the next $i^{(t+1)}=i^{(t)}+b$. * $\eta$ is the learning rate (default in Tensorflow/Keras $0.01$) * Properties * Update parameters using a subset of samples. (For each epoch, the parameters update $\frac{n}{b} + n\text{ mod }{b}$ times). This idea is also widely used for other optimizers. * Slower compare to stochastic gradient descent. * More accurate (for convex loss function) compare to mini-batch gradient descent. * Vunerable to __local minimum__. * Sensitive to learning rate $\eta$. ### Adaptive Gradient Descent (AdaGrad) * Equation \begin{align}w^{(t)}&=w^{(t-1)}-\frac{\eta}{\sqrt{\sum_tg^{(t)2}}+\epsilon} g^{(t)}\\ g^{(t)}&=∇_wL(w)|_{w^{(t-1)}} \end{align}where * $\epsilon$ is a constant that make sure the denominator is not zero (default in Tensorflow/Keras $1e-7$). * $\eta$ is the learning rate (default in Tensorflow/Keras $0.001$) * Properties * The magnitude of the update is smaller in later iterations. * __Least__ sensitive to learning rate $\eta$. * The update become __very inefficient__ in later iterations. * Very vunerable to __local minimum__. ### RMSprop * Equation \begin{align}w^{(t)}&=w^{(t-1)}-\frac{\eta}{\sqrt{E[g^{(t)2}]} + \epsilon}\nabla_wL(w)|_{w^{(t-1)}}\\ g^{(t)}&=\nabla_wL(w)|_{w^{(t-1)}}\\ E[g^{(t)2}]&=\beta E[g^{(t-1)2}]+(1-\beta)g^{(t)2} \end{align}where * $\epsilon$ is a constant that make sure the denominator is not zero (default in Tensorflow/Keras $1e-7$). * $\beta$ is the parameter that control how much the model should consider the computed gradient (sum of magnitude) before. (default in Tensorflow/Keras $0.9$) * $\eta$ is the learning rate (default in Tensorflow/Keras $0.001$) * Properties * The magnitude of the update is (usually) smaller in later iterations. Unlike AdaGrad, RMSprop compare the magnitude of current gradient with the sum of previous gradients, which allows bigger update when current gradient is larger in later iteration (which makes it suitable for the loss function in __recurrent neural network__.) * __Slightly less__ vunerable to local minimum. * __Slightly Less__ sensitive to learning rate $\eta$. ### Momentum: * Equation \begin{align}w^{(t)}&=w^{(t-1)}-v^{(t)}\\ v^{(t)}&=\gamma v^{(t-1)}+\eta\nabla_wL(w)|_{w^{(t-1)}} \end{align}where * $\gamma$ is the parameter that control how much the model should consider the computed gradient before. * Properties * Implemented in `SGD` in Tensorflow and Keras. * Consider previous gradient when update the parameters. * __Less__ vunerable to local minimum. * Sensitive to learning rate $\eta$. ### Nesterov Accelerate Gradient * Equation \begin{align}w^{(t)}&=v^{(t)}-\eta\nabla_vL(v)|_{v^{(t-1)}}\\ v^{(t)}&=w^{(t-1)}+\gamma(w^{(t-1)}-w^{(t-2)}) \end{align} * Properties * Reach convergence faster than momemtum. * Consider previous gradient when update the parameters * __Less__ vunerable to local minimum. * Sensitive to learning rate $\eta$. ### Adam * Equation \begin{align} w^{(t)}&=w^{(t-1)}-\frac{\eta}{\sqrt{\hat{v}^{(t)}} + \epsilon}\hat{m}^{(t)}\\ m^{(t)}&=\beta_1m^{(t-1)}+(1-\beta_1)g^{(t)}\\ v^{(t)}&=\beta_2v^{(t-1)}+(1-\beta_2)g^{(t)2}\\ g^{(t)}&=\nabla_wL(w)|_{w^{(t-1)}}\\ \hat{m}^{(t)}&=\frac{m^{(t)}}{1-\beta_1}\\ \hat{v}^{(t)}&=\frac{v^{(t)}}{1-\beta_2}\\ \end{align} where * $\epsilon$ is a constant that make sure the denominator is not zero. (default in Tensorflow/Keras $1e-7$). * $\beta_1$ is the parameter that control how much the model should consider the computed gradient (consider both direction and magnitude) before. (default in Tensorflow/Keras $0.9$) * $\beta_2$ is the parameter that control how much the model should consider the computed gradient (sum of magnitude) before. (default in Tensorflow/Keras $0.999$) * $\eta$ is the learning rate (default in Tensorflow/Keras $0.001$) * Properties * Unlike AdaGrad and RMSprop, Adam allows large update when the direction of the gradient starts to change (which makes it suitable for the loss function in __recurrent neural network__.) * $m$ is similar to momemtum, while $v$ is similar to RMSprop. * __Less__ vunerable to local minimum. * __Less__ sensitive to learning rate $\eta$. ### Nadam * Equation \begin{align} w^{(t)}&=w^{(t-1)}-\frac{\eta}{\sqrt{\hat{v}^{(t)}} + \epsilon}(\beta_1\hat{m}^{(t)}+(1-\beta_1)\hat{g}^{(t)})\\ m^{(t)}&=\beta_1m^{(t-1)}+(1-\beta_1)g^{(t)}\\ v^{(t)}&=\beta_2v^{(t-1)}+(1-\beta_2)g^{(t)2}\\ g^{(t)}&=\nabla_wL(w)|_{w^{(t-1)}}\\ \hat{g}^{(t)}&=\frac{g^{(t)}}{1-\beta_1}\\ \hat{m}^{(t)}&=\frac{m^{(t)}}{1-\beta_1}\\ \hat{v}^{(t)}&=\frac{v^{(t)}}{1-\beta_2}\\ \end{align} where * $\epsilon$ is a constant that make sure the denominator is not zero. (default in Tensorflow/Keras $1e-7$). * $\beta_1$ is the parameter that control how much the model should consider the computed gradient (direction and magnitude) before. (default in Tensorflow/Keras $0.9$) * $\beta_2$ is the parameter that control how much the model should consider the computed gradient (sum of magnitude) before. (default in Tensorflow/Keras $0.999$) * $\eta$ is the learning rate (default in Tensorflow/Keras $0.001$) * Properties * Perform similarly to Adam, but _converge faster (?)_. * $m$ is similar to momemtum, while $v$ is similar to RMSprop. * __Less__ vunerable to local minimum. * __Less__ sensitive to learning rate $\eta$. ## Batch and Epoch * Batch: subset of data used to update the parameters * Epoch: one iteration that every sample in the dataset has been used to update the parameter. For each epoch: * Batch gradient descent update the parameters once. * Stochastic gradient descent update the parameters $n$ times. * Mini-batch gradient descent, the parameters update $\frac{n}{b} + n\text{ mod }{b}$ times