# Notes on convexity and gradient descent
#### Author: [Sharath](https://sharathraparthy.github.io/)
## What is a convex function?
Let's try to define a convex function formally and geometrically. Formally, a function $f$ is said to be a convex function if the domain of $f$ is a convex set and if it satisfies the following $\forall x \ \text{and} \ y \in \text{dom} f$;
\begin{equation}
f(\theta x + (1 - \theta)y) \leq \theta f(x) + (1 - \theta)f(y)
\end{equation}
Geometrically it means that the value of a function at the convex combination of two points of the function always lies below the convex combination of the values at the corresponding points. It means that if we draw a line at any two points $(x, y) \in \text{dom} f$, then this line/chord always lies above the function $f$.
### First order and second order conditions for convexity
We can extend this definition to higher order derivatives as well if the function $f$ is "higher-order" differentiable. We have two such definitions;
#### First order condition:
For a first order differentiable function $f$, where the derivative exists for all point in the domain of $f$, we say the $f$ is convex iff
\begin{equation}
f(y) \geq f(x) + \nabla f(x)^{T}(y x)
\end{equation}
There is a very interesting interpretation of this condition which is nicely explained in boyd's convex optimization textbook. If we have a closer look at the right hand side of the inequality, we can quickly recognize that it is a first order taylor expansion of the function $f$ near $x$. So, this condition says that, if the first order taylor expansion is always a *global under-estimator* of the function, then the function is convex.
Another interesting view of this condition is that, just by using the local information provided by the derivative of the function at $x$, we can derive a global under-estimator of it. This can be used to further prove that the local minima is a global minima for convex function.
#### Second order condition:
If the function $f$ is twice differentiable then we say the the function is convex if the hessian is positive semi-definite.
$$
\nabla^2 f(x) \geq 0
$$
## Smoothness
A smooth function is a differentiable function with Lipschitz gradients. This means that for two points $(x, y) \in \text{dom} \ f$, the norm of the gradients at these two points is upper bounded by the norm between the two points.
$$
\forall x, y \ \ \ ||\nabla f(x) - \nabla f(y)||_2 \leq L ||x - y||_2
$$
where $L$ is a Lipschitz constant. Roughly it means that around each two points, the gradients should not vary too much. This is not true if the function has kinks.
For a twice differentiable function, we have the following condition for smoothness;
$$
\nabla^2f(x) \leq \mu I_d
$$
## Strong convexity and strong smoothness
### Strong convexity
This intuitively means, if we draw a tangent plane, the function $f$ always lies above the tangent plane for all the points in the domain of the function $f$. Formally, we can say a function is strongly convex if
\begin{equation}
f(y) \geq f(x) + \nabla f(x)^{T}(y - x) + \underbrace{\frac{\mu}{2} || y - x||^2_2}_{\text{new term}}
\end{equation}
where $\alpha$ is a convexity coefficient.
In some sense, in a simply convex function case, the gradient can be flat (in the sense, there may be some points where the gradient doesn't change at all). But in case of string convex function, the gradient strictly changes at every point. This picture can explain the intuition.
### Strong smoothness
In case of a strongly smooth function, the tangent plane lies always below the function.Here we can flip the inequality of the condition for strong convexity;
\begin{equation}
f(y) \leq f(x) + \nabla f(x)^{T}(y - x) + \underbrace{\frac{\mu}{2} || y - x||^2_2}_{\text{new term}}
\end{equation}
We can prove this by simply writing the taylor expansion of $f$ around $x$. This gives,
$$
f(y) = f(x) + \nabla f(x)^{T}(y - x) + \frac{1}{2} (y - x)^{T}\nabla^2f(x)(y - x) + ....
$$
But from the definition of smoothness we know that $\nabla^2f(x) \leq \mu I_d$. And hence we the we can upper bound the taylor series by
\begin{equation}
\begin{split}
f(y) &\leq f(x) + \nabla f(x)^{T}(y - x) + \frac{1}{2}(y - x)^{T} \mu I_d (y - x)\\
&\leq f(x) + \nabla f(x)^{T}(y - x) + \frac{\mu}{2}||y - x||_{2}^{2}
\end{split}
\end{equation}
## Gradient descent method
In optimization, we are interested searching for the in minima of the function by taking some baby steps towards that direction proportional to the gradient or the hessian. In case of convex functions the local minima is the global minima. One of the intuitive ways to achieve this is to move along the negative gradient of the function $-\nabla f(x)$. The update rule is given by
$$
\theta_{t+1} = \theta_{t} - \eta \nabla f(\theta_t)
$$
where $\eta$ is the step-size.
The resulting algorithm is called the "*gradient descent*" and is widely used in machine learning because of the strong convergence properties. Let's have a closer look at the convergence results for two cases:
1. Strong smoothness case
1. Strong convexity case
### Gradient descent lemma for strong smooth functions
Here let's try to show that for a strong smooth function $f$, the gradient descent converges. Since the smoothness condition is true for all $x, y$, we can apply the same condition for particular $x$ and $y$. Specifically let $y = \theta_{t+1}$ and $x = \theta_t$ . Note that, we can modify the GD update to get, $\theta_{t+1} - \theta_{t} = - \eta \nabla f(\theta_t)$. By plugging these, we get
\begin{equation}
\begin{split}
f(\theta_{t+1}) &\leq f(\theta_t) + \nabla f(\theta_t)^{T}(\theta_{t+1} - \theta_t) + \frac{\mu}{2}||\theta_{t+1} - \theta_t||_{2}^{2} \\
&\leq f(\theta_t) + \nabla f(\theta_t)^{T} (- \eta \nabla f(\theta_t)) + \frac{\mu}{2} \eta^2 ||\nabla f(\theta_t)||_2^2\\
&\leq f(\theta_t) \underbrace{- \eta ||\nabla f(\theta_t)^{T}||_2^{2}}_{\text{term 1}} + \underbrace{\frac{\mu}{2} \eta^2 || \nabla f(\theta_t)||_2^2}_{\text{term 2}}
\end{split}
\end{equation}
We can see that the "term 1" is linear in step size and the "term 2" is quadratic in step size. For small values of step size, the linear term dominates the quadratic term making the net still negative. But this is not true for the large step size values. So there should be a notion of optimal step size to comment about the convergence in this case. So let's try to find that. This is fairly simple as the right hand side is a quadratic function of $\eta$. So, the optimal values can be found just by equating the gradient wrt $\eta$ to zero. And the optimal value is $\eta^\star = \frac{1}{\mu}$.
So for this optimal step size, we get;
\begin{equation}
\begin{split}
f(\theta_{t+1})
&\leq f(\theta_t) - \frac{1}{2 \mu} ||\nabla f(\theta_t)^{T}||_2^{2}
\end{split}
\end{equation}
We can also say that this holds for all $\eta \leq \frac{1}{L}$;
\begin{equation}
\begin{split}
f(\theta_{t+1})
&\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)^{T}||_2^{2}
\end{split}
\end{equation}
#### What is so beautiful about this lemma?
This lemma talks only about the smooth functions and not the convex functions and guarantees the convergence. So, this lemmma is at the heart of the proof when we talk about the non-convex functions which is often the case with DL models.
### Gradient descent lemma for strong convex functions
Now let's try to prove the convergence of GD for the strongly convex case. Recall that, for a differentiable function $f$, the string convexity condition is given by
$$
f(y) \geq f(x) + \nabla f(x)^{T}(y - x) + \frac{\mu}{2} || y - x||^2_2
$$
This inequality is used to bound $f(x) - f^\star$ where $p*$ is the optimal point. Let's try to find $y$ which minimizes the right hand side. This can be found by equating the gradient wrt y to zero. We get, $\tilde{y} = \frac{1}{\mu}\nabla f(x)$. By substituting this, we get;
$$
f(y) \geq f(x) - \frac{1}{2\mu} ||\nabla f(x)||_2^{2}
$$
This holds for any $y$, so we have;
$$
f^\star \geq f(x) - \frac{1}{2\mu} ||\nabla f(x)||_2^{2}
$$
The intuition of this inequality is that if the gradient is small then we are close to the optimum.
Now from the above lemma we know that;
\begin{equation}
\begin{split}
f(\theta_{t+1})
&\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)||_2^{2}
\end{split}
\end{equation}
By substituting $||\nabla f(\theta_t)||_2^{2} \leq \frac{\mu}{2} (f(x) - f^\star)$ and $x = \theta_t$,
\begin{equation}
\begin{split}
f(\theta_{t+1}) &\leq f(\theta_t) - \frac{\eta}{2} ||\nabla f(\theta_t)||_2^{2} \\
&\leq f(\theta_t) - \frac{\eta \mu}{2} (f(\theta_t) - f^\star)\\
f(\theta_{t+1}) - f^\star &\leq (f(\theta_t) - f^\star) (1 - \frac{\eta \mu}{2})\\
&\leq (f(\theta_{t-1}) - f^\star) (1 - \frac{\eta \mu}{2})^2\\
&\leq (f(\theta_0) - f^\star) (1 - \frac{\eta \mu}{2})^t
&\leq (f(\theta_0) - f^\star) (1 - \frac{\mu}{L})^t
\end{split}
\end{equation}
So in case of strong convexity, the convergence rate is linear.
<!-- ## Is local optimality == global optimality for a convex function
I'll try to intuitively make an argument about this. Let's bring back the first order condition of convexity. \begin{equation}
f(y) \geq f(x) + \nabla f(x)^{T}(y x)
\end{equation}
Lets say that, we have a global minima at some point $x$. Now, at the optimal points, the gradient is zero ($\nabla f(x) = 0$). By doing that , we can see that $f(y) \geq f(x)$ which means that for all $y \in \text{dom} f$, $x$ is a global minimizer. -->
<!-- ## Implications of strong convexity
-->

SharathRaparthy
I am a Masters student at Mila working on continual reinforcement learning.

Author: Sharath Paper Link Overview Curriculum learning is inspired by the way human learns, where the examples are shown in the increasing order of the difficulty. More sepcifically te network is exposed to the easier examples in the early stages of training and then gradually to the tougher ones. This paper studies the benifits of showing this sequential ordered eamples to the network and comments about when it works and when it doesn't. Contributions: This paper introduces a phenomenon called implicit curricula. One of the claims they make is that the the ordered learning (curriculum, anti-curriculum and random) almost performs same in the standard settings. Curricula is benificial when there is a limited time budget and in noisy regime

7/4/2021Author: Sharath Overview This paper discusses how the BNNs behave as we scale to the large models. This paper discusses the "soap-bubble" issue in case of high dimensional probability spaces and how MFVI suffers from this. As a way to tackle this issue, the authors propose a new variational posterior approximation in hyperspherical coordinate system and show that this overcomes the soap-bubble issue when we sample from this posterior. The geometry of high dimensional spaces One of the properties of high dimensional spaces is that there is much more volume outside any given neighbourhood than inside of it. Betacount et al explained this behaviour visually with two intuitive examples. For first example let us consider partitioning our parameter space in equal rectangular intervals as shown below. We can see that as we increase the dimensions the distribution of volume around the center decreases. This becomes almost negligible as compared to the its neighbourhood in high dimensional cases where $D$ is very large. We can observe a similar behaviour if we consider spherical view of parameter space, where the exterior volume grows even larger than the interior in high dimensional spaces as shown in the figure. . How this intuition of volumes in high dimensions explain soap bubble phenomenon?

7/4/2021Author: Sharath What is a dynamical system? It is any system that evolves and changes through time governed by a set of rules. Using dynamical systems we can study the long term behavior of an evolving system. Formally, it is a triplet $(X, T, \phi)$ where $X$ denotes the state space, $T$ denotes the time space and $\phi: X \times T \rightarrow X$ is the flow (this is the rule that governs the evolution). There are few properties of flow: $\phi(X, 0) = X$ Principle of compositionality: $\phi(\phi(x, t), s) = \phi(x, t+s)$

7/4/2021Author: Sharath Chandra Paper Link tags: simulation, interaction-networks, robotics In this paper the authors proposed a hybrid dynamics model, Simulation-Augemented Interaction Networks, where they incorporated Interaction Networks into a physics engine for solving real world complex robotics control tasks. Brief Outline: Most of the physics based simulators serves as a good platform for carrying out robot planning and control tasks. But no simulator is a perfect because it has it's own modelling errors. So, most of the physics engines (mujoco, bullet, gazebo etc.,) demonstrate some descrepencies between their predictions and actual real world predictions. To decrease these errors, many methods have been propsed in the literature. Some of the methods include randomizing the simulation environments, famously known as Domain Randomization. In this paper, model errors are tackled by learning a residual model between the real world and simulator. In other words, instead of adding pertubations to the environment parameters, here we utilize some real world data to correct the simulator. Even though this method uses some real world data, this method is shown to be sample efficient and have better generalization capabilities. Interaction Networks

7/4/2021
Published on ** HackMD**