Notes on sharpening Jensen's Inequality

# Notes on sharpening Jensen's Inequality Consider a convex function $\phi(x)$ and a random variable $X$. Jensen's inequality states that $$ \mathbb{E}[\phi(X)] \geq \phi(\mathbb{E}[X]) $$ ([Liao and Berg, 2017](https://arxiv.org/abs/1707.08644)) sharpened this inequality, made the relationship more expressive, and also provide a bound that works potentially in both direction. The derivation is simple. Here I'm going to give the derviation assuming $X$ is a scalar random variable, but similar derivation holds for multivariate variables, too. ### What we expect intuitively Before deriving a bound it's worth thinking about what we expect the gap in the bound to depend on, to get an intuition about what additional terms we can perhaps throw in there to sharpen it. Intuitively, Jensen's inequlity holds strongest (i.e. the gap is largest) when: * the function's curvature is high in the support of the random variable $X$. If the function is linear, Jensen's inequality is in fact an equality. The more non-linear a function is (which in the case of a convex function means hight curvature), the more we deviate from this * the entropy/variance of $X$ is large. If $X$ is a deterministic variable, equality holds. Locally all functions look linear, but the more $X$ is spread out, the more curvature of the function $\phi$ we encounter. So we expect to be able to close the Jensen gap by including some extra terms that likely depend on the variance/entropy of $X$ and the curvature of $\phi$. ### Derivation Let's model $\phi$ as its first order Taylor approximation at $\mu = \mathbb{E}[X]$ plus whatever the difference is: $$ \phi(x) = \phi(\mu) + \phi'(\mu)(x - \mu) + r(x) $$ Using this, the difference between the two sides of Jensen's inequality becomes: \begin{align} \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X]) &= \mathbb{E}\left[\phi(\mu) + \phi'(\mu)(X - \mu) + r(X) - \phi(\mu)\right]\\ &= \phi'(\mu)\mathbb{E}[X - \mu] + \mathbb{E}[r(X)]\\ &= \mathbb{E}[r(X)] \end{align} So, somewhat unsurprisingly, what we are looking at is the expected value of the difference between $\phi(x)$ and its first order Taylor approximation at $\mu$. Motivated by the above intuition we can try to bound this in the following way: $$ \mathbb{E}[r(X)] = \mathbb{E}\left[\frac{r(X)}{(x - \mu)^2} (x - \mu)^2\right] $$ Now let's say we can bound $\frac{r(X)}{(x - \mu)^2}$ such that $$ C_{min} \leq \frac{r(X)}{(x - \mu)^2} \leq C_{max}, $$ which is usually doable if $X$ takes the value in a finite interval $[a,b]$, we can bound the Jensen gap as: $$ C_{\min}\mathbb{V}[X] \leq \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X]) \leq C_{max}\mathbb{V}[X]. $$ The constants here depend on the curvature of $\phi$. If $x$ takes values in $[a, b]$, then $$ C_{max} \leq \sup_{x\in [a,b]} \frac{\phi''(x)}{2} $$ And similarly for the lower bound $$ C_{min} \geq \inf_{x\in [a,b]} \frac{\phi''(x)}{2} $$ So, putting this together, a weaker form of the bound is: $$ \inf_{x\in [a,b]} \frac{\phi''(x)}{2}\mathbb{V}[X] \leq \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X]) \leq \sup_{x\in [a,b]} \frac{\phi''(x)}{2}\mathbb{V}[X] $$ assuming that $X$ takes values within the interval $[a,b]$. This confirms our intuition that the gap in Jensen's inequality is monotonically increasing in the curvature of $\phi$ and the variance of $X$. ### Alternative versions of the bound Depending on what $X$ is, and how the bound is used, the variance $\mathbb{V}[X]$ may not be the most convenient/meaningful choice of entropic measure. I considered the following generalised versions of the bound: #### Mutual information In many ML applications, such as in ([Masegosa, 2020](https://arxiv.org/abs/1912.08335)) we apply Jensen's inequality to variables where $X = p(z\vert \theta)$ where $\theta$ is a random variable. In these same applications, $\phi$ is often the negative logarithm. In such cases it may be beneficial to change the last step of the derivation like so: $$ \mathbb{E}[r(X)] = \mathbb{E}\left[\frac{r(x)}{\log X - \log \mu} \log \frac{X}{\mu} \right] $$ Now we can try to bound $$ \tilde{C}_{min} \leq \frac{r(x)}{\log X - \log \mu} \leq \tilde{C}_{max} $$ And now we have a bound of the form: $$ \tilde{C}_{\min}\mathbb{E}\left[\log\frac{X}{\mu}\right] \leq \mathbb{E}[\phi(X)] - \phi(\mathbb{E}[X]) \leq \tilde{C}_{max}\mathbb{E}\left[\log\frac{X}{\mu}\right]. $$ Why would this be useful? Remember we assumed that $X = p(z\vert \theta)$. Substituting this: $$ \mathbb{E}_{X}\left[\log\frac{X}{\mu}\right] = \mathbb{E}_\theta\left[\log \frac{p(z\vert \theta)}{\mathbb{E}_\theta p(z\vert \theta)}\right] = \mathbb{I}[Z, \theta], $$ where $\mathbb{I}$ is the Shannon's mutual information between $Z$ and $\theta$. In some applications, including ([Masegosa, 2020](https://arxiv.org/abs/1912.08335)) this is a more meaningful quantity than $$ \mathbb{V}[X] = \mathbb{E}_\theta\left[(p(z\vert \theta) - \mathbf{E}_\theta p(z\vert \theta))\right]. $$ If $\phi(x) = -\log X$, as it is often the case in the ML applications of Jensen's inequality, we have to find a bound for the following quantity: \begin{align} \frac{\phi(x) - \phi(\mu) - (x - \mu)\phi'(\mu)}{\log x - \log \mu} &= -\frac{\log x - \log \mu - \frac{(x - \mu)}{\mu}} {\log x - \log \mu} \\ &= \frac{x - \mu}{\mu\log(\frac{x}{\mu})}-1 \end{align} ##### Why this does not work? Now I realised this version of the bound won't work. It's because $\log{x\frac \mu}$ is not necessarily positive. This means that the ratio we have to upper and lower bound can be negative, too, and in fact it will be negative whenever $x\leq \mu$. Hence, our lower bound will also be negative, so instead of a stronger Jensen's bound we end up with a weaker one. #### Entropy Another variation on the bound, if the pdf of $X$ is $p$ is to use: $$ \mathbb{E}[r(X)] = \mathbb{E}\left[\frac{-r(x)}{\log p(X)} \left(-\log p(X)\right) \right] $$ Again, if we can find an upper and lower bound for $$ \frac{-r(x)}{\log p(x)} $$ then we can have a more precise bound in terms of the entropy $\mathbb{H}[X]$, instead of the variance. This may sometimes be more useful than the variance.