###### tags: `one-offs` `concentration` `martingales` `convexity`
# Hoeffding's Inequality by Convex Ordering
**Overview**: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering.
## The Basic Construction
Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by
\begin{align}
M=\begin{cases}
a & \text{w.p. }\frac{b-x}{b-a}\\
b & \text{w.p. }\frac{x-a}{b-a}.
\end{cases}
\end{align}
One can then check that M has expectation $x$.
## The Hoeffding Martingale
Now, consider a random variable $X$ which takes values in a subset of $\left[ a, b\right]$, and conditional on $X$, sample $Y \sim M \left( X; a, b \right)$. Following our previous observation, we see that $\mathbf{E} \left[Y\mid X\right] = X$, i.e. the ordered pair $\left( X, Y\right)$ is a martingale. Note also that marginally, $Y \overset{d}{=} M \left( \mathbf{E} \left[ X \right]; a, b\right)$.
Because $\left(X, Y\right)$ form a martingale, by Jensen's inequality, it follows that they are also convex-ordered, i.e. for any convex $F$, it holds that $\mathbf{E} \left[ F\left(X\right) \right] \leqslant \mathbf{E} \left[ F\left(Y\right) \right]$. (Actually, a result of Strassen establishes that this characterisation is tight, i.e. being convex-ordered also implies the martingale property.)
In particular, letting $F \left( x \right) = \exp \left( t \cdot x \right)$ for $t \in \mathbf{R}$, it holds that the moment-generating function of $X$ is dominated by the moment-generating function of $Y = M \left( \mathbf{E} \left[ X \right]; a, b \right)$. Because $Y$ is a two-point distribution, its MGF is thus relatively easy to control. Indeed, the MGF of $Y$ is essentially that of a standard biased coin flip (up to shifting and scaling). Using basic arguments from the theory of exponential families, one sees that
\begin{align}
\left(\log M_{Y}\right)^{''}\left(t\right)\leqslant\frac{\left(b-a\right)^{2}}{4}
\end{align}
uniformly in $t$, from which one can bound
\begin{align}
M_{Y}\left(t\right) \leqslant \exp \left(t \cdot \mathbf{E} \left[X\right] + \frac{t^{2}}{8} \cdot \left(b-a\right)^{2} \right).
\end{align}
One can then see that
\begin{align}
M_{X} \left(t\right) \leqslant M_{Y} \left(t\right) \leqslant \exp \left(t \cdot \mathbf{E} \left[X\right] + \frac{t^{2}}{8} \cdot \left(b-a\right)^{2}\right),
\end{align}
and this is essentially the content of Hoeffding's lemma, from which one often deduces Hoeffding's (concentration) inequality.
For me, this is nice, because it highlights that the inequality is essentially a comparison which says that "bounded RVs fluctuate less than a (n appropriate) 2-point random RV". The specifics of the MGF and the tail inequality are the practically relevant consequences of this.
A potentially interesting variant of this idea would be to consider a construction where $Y$ takes values on e.g. the set $\left[ a, s \right] \cup \left[ t, b \right]$ for some $a<s<t<b$. A similar ordering structure ought to be present, and perhaps there is some nice interpolation result along these lines.
For what it's worth, there is also a construction in [this paper](https://arxiv.org/abs/1909.12067) which is not dissimilar to this idea (but is certainly more ambitious in its scope than Hoeffding's lemma!).

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I give a quick proof of a nice convexity result. From Moment-Generating Functions to Uniform Continuity Constants A well-known fact in the context of exponential family theory (and / or statistical mechanics) is that for suitable functions $F$, the cumulant-generating function \begin{align} A:\theta\mapsto\log\left(\int\mu\left(\mathrm{d}x\right)\cdot\exp\left(\left\langle \theta,F\left(x\right)\right\rangle \right)\right) \end{align} is convex, modulo a few standard conditions. Less well-known (but perhaps indirectly related?) is that the logarithm of the $C^{\alpha}$ seminorm of a function is also convex! That is, fix a function $F$ and define

5/22/2023
Published on ** HackMD**