Notes on convexity and gradient descent

# Notes on convexity and gradient descent #### Author: [Sharath](https://sharathraparthy.github.io/) ## What is a convex function? Let's try to define a convex function formally and geometrically. Formally, a function $f$ is said to be a convex function if the domain of $f$ is a convex set and if it satisfies the following $\forall x \ \text{and} \ y \in \text{dom} f$; \begin{equation} f(\theta x + (1 - \theta)y) \leq \theta f(x) + (1 - \theta)f(y) \end{equation} Geometrically it means that the value of a function at the convex combination of two points of the function always lies below the convex combination of the values at the corresponding points. It means that if we draw a line at any two points $(x, y) \in \text{dom} f$, then this line/chord always lies above the function $f$. ### First order and second order conditions for convexity We can extend this definition to higher order derivatives as well if the function $f$ is "higher-order" differentiable. We have two such definitions; #### First order condition: For a first order differentiable function $f$, where the derivative exists for all point in the domain of $f$, we say the $f$ is convex iff \begin{equation} f(y) \geq f(x) + \nabla f(x)^{T}(y x) \end{equation} There is a very interesting interpretation of this condition which is nicely explained in boyd's convex optimization textbook. If we have a closer look at the right hand side of the inequality, we can quickly recognize that it is a first order taylor expansion of the function $f$ near $x$. So, this condition says that, if the first order taylor expansion is always a *global under-estimator* of the function, then the function is convex. Another interesting view of this condition is that, just by using the local information provided by the derivative of the function at $x$, we can derive a global under-estimator of it. This can be used to further prove that the local minima is a global minima for convex function. #### Second order condition: If the function $f$ is twice differentiable then we say the the function is convex if the hessian is positive semi-definite. $$ \nabla^2 f(x) \geq 0 $$ ## Smoothness A smooth function is a differentiable function with Lipschitz gradients. This means that for two points $(x, y) \in \text{dom} \ f$, the norm of the gradients at these two points is upper bounded by the norm between the two points. $$ \forall x, y \ \ \ ||\nabla f(x) - \nabla f(y)||_2 \leq L ||x - y||_2 $$ where $L$ is a Lipschitz constant. Roughly it means that around each two points, the gradients should not vary too much. This is not true if the function has kinks. For a twice differentiable function, we have the following condition for smoothness; $$ \nabla^2f(x) \leq \mu I_d $$ ## Strong convexity and strong smoothness ### Strong convexity This intuitively means, if we draw a tangent plane, the function $f$ always lies above the tangent plane for all the points in the domain of the function $f$. Formally, we can say a function is strongly convex if \begin{equation} f(y) \geq f(x) + \nabla f(x)^{T}(y - x) + \underbrace{\frac{\mu}{2} || y - x||^2_2}_{\text{new term}} \end{equation} where $\alpha$ is a convexity coefficient. In some sense, in a simply convex function case, the gradient can be flat (in the sense, there may be some points where the gradient doesn't change at all). But in case of string convex function, the gradient strictly changes at every point. This picture can explain the intuition. ### Strong smoothness In case of a strongly smooth function, the tangent plane lies always below the function.Here we can flip the inequality of the condition for strong convexity; \begin{equation} f(y) \leq f(x) + \nabla f(x)^{T}(y - x) + \underbrace{\frac{\mu}{2} || y - x||^2_2}_{\text{new term}} \end{equation} We can prove this by simply writing the taylor expansion of $f$ around $x$. This gives, $$ f(y) = f(x) + \nabla f(x)^{T}(y - x) + \frac{1}{2} (y - x)^{T}\nabla^2f(x)(y - x) + .... $$ But from the definition of smoothness we know that $\nabla^2f(x) \leq \mu I_d$. And hence we the we can upper bound the taylor series by \begin{equation} \begin{split} f(y) &\leq f(x) + \nabla f(x)^{T}(y - x) + \frac{1}{2}(y - x)^{T} \mu I_d (y - x)\\ &\leq f(x) + \nabla f(x)^{T}(y - x) + \frac{\mu}{2}||y - x||_{2}^{2} \end{split} \end{equation} ## Gradient descent method In optimization, we are interested searching for the in minima of the function by taking some baby steps towards that direction proportional to the gradient or the hessian. In case of convex functions the local minima is the global minima. One of the intuitive ways to achieve this is to move along the negative gradient of the function $-\nabla f(x)$. The update rule is given by $$ \theta_{t+1} = \theta_{t} - \eta \nabla f(\theta_t) $$ where $\eta$ is the step-size. The resulting algorithm is called the "*gradient descent*" and is widely used in machine learning because of the strong convergence properties. Let's have a closer look at the convergence results for two cases: 1. Strong smoothness case 1. Strong convexity case ### Gradient descent lemma for strong smooth functions Here let's try to show that for a strong smooth function $f$, the gradient descent converges. Since the smoothness condition is true for all $x, y$, we can apply the same condition for particular $x$ and $y$. Specifically let $y = \theta_{t+1}$ and $x = \theta_t$ . Note that, we can modify the GD update to get, $\theta_{t+1} - \theta_{t} = - \eta \nabla f(\theta_t)$. By plugging these, we get \begin{equation} \begin{split} f(\theta_{t+1}) &\leq f(\theta_t) + \nabla f(\theta_t)^{T}(\theta_{t+1} - \theta_t) + \frac{\mu}{2}||\theta_{t+1} - \theta_t||_{2}^{2} \\ &\leq f(\theta_t) + \nabla f(\theta_t)^{T} (- \eta \nabla f(\theta_t)) + \frac{\mu}{2} \eta^2 ||\nabla f(\theta_t)||_2^2\\ &\leq f(\theta_t) \underbrace{- \eta ||\nabla f(\theta_t)^{T}||_2^{2}}_{\text{term 1}} + \underbrace{\frac{\mu}{2} \eta^2 || \nabla f(\theta_t)||_2^2}_{\text{term 2}} \end{split} \end{equation} We can see that the "term 1" is linear in step size and the "term 2" is quadratic in step size. For small values of step size, the linear term dominates the quadratic term making the net still negative. But this is not true for the large step size values. So there should be a notion of optimal step size to comment about the convergence in this case. So let's try to find that. This is fairly simple as the right hand side is a quadratic function of $\eta$. So, the optimal values can be found just by equating the gradient wrt $\eta$ to zero. And the optimal value is $\eta^\star = \frac{1}{\mu}$. So for this optimal step size, we get; \begin{equation} \begin{split} f(\theta_{t+1}) &\leq f(\theta_t) - \frac{1}{2 \mu} ||\nabla f(\theta_t)^{T}||_2^{2} \end{split} \end{equation} We can also say that this holds for all $\eta \leq \frac{1}{L}$; \begin{equation} \begin{split} f(\theta_{t+1}) &\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)^{T}||_2^{2} \end{split} \end{equation} #### What is so beautiful about this lemma? This lemma talks only about the smooth functions and not the convex functions and guarantees the convergence. So, this lemmma is at the heart of the proof when we talk about the non-convex functions which is often the case with DL models. ### Gradient descent lemma for strong convex functions Now let's try to prove the convergence of GD for the strongly convex case. Recall that, for a differentiable function $f$, the string convexity condition is given by $$ f(y) \geq f(x) + \nabla f(x)^{T}(y - x) + \frac{\mu}{2} || y - x||^2_2 $$ This inequality is used to bound $f(x) - f^\star$ where $p*$ is the optimal point. Let's try to find $y$ which minimizes the right hand side. This can be found by equating the gradient wrt y to zero. We get, $\tilde{y} = \frac{1}{\mu}\nabla f(x)$. By substituting this, we get; $$ f(y) \geq f(x) - \frac{1}{2\mu} ||\nabla f(x)||_2^{2} $$ This holds for any $y$, so we have; $$ f^\star \geq f(x) - \frac{1}{2\mu} ||\nabla f(x)||_2^{2} $$ The intuition of this inequality is that if the gradient is small then we are close to the optimum. Now from the above lemma we know that; \begin{equation} \begin{split} f(\theta_{t+1}) &\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)||_2^{2} \end{split} \end{equation} By substituting $||\nabla f(\theta_t)||_2^{2} \leq \frac{\mu}{2} (f(x) - f^\star)$ and $x = \theta_t$, \begin{equation} \begin{split} f(\theta_{t+1}) &\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)||_2^{2} \\ &\leq f(\theta_t) - \frac{\eta \mu}{2} (f(\theta_t) - f^\star)\\ f(\theta_{t+1}) - f^\star &\leq (f(\theta_t) - f^\star) (1 - \frac{\eta \mu}{2})\\ &\leq (f(\theta_{t-1}) - f^\star) (1 - \frac{\eta \mu}{2})^2\\ &\leq (f(\theta_0) - f^\star) (1 - \frac{\eta \mu}{2})^t &\leq (f(\theta_0) - f^\star) (1 - \frac{\mu}{L})^t \end{split} \end{equation} So in case of strong convexity, the convergence rate is linear.