###### tags: `one-offs` `inequalities`
# Why not Orlicz Norms?
**Overview**: The following note gives a brief overview of Orlicz norms, why they are sometimes useful to work with, and why they might deserve 'a seat at the table' with $L^p$ norms.
## Introduction
I recently came across a question on MathOverflow in which the asker was curious about the significance of $L^p$ spaces outside of 'the usual suspects' of $p \in \{ 1, 2, \infty \}$. This is a question I always like asking, because people tend to come out with cool examples for which other $p$ are relevant and even insightful.
Anyways, in the comments section, a commenter made a point to the effect of 'it is not just these other values of $p$ which we neglect, but other norms and Banach spaces at large'. They specifically mentioned Orlicz spaces as an example of spaces which are in a similar vein to $L^p$ spaces, but maybe not as popular to work with.
From my point of view, the neglect is relatively transparent: theorists like to work with objects which have convenient properties, and for which it is straightforward to formulate positive assumptions. Still, I find these other spaces a bit fun to play with, and for many results of interest, the generalisation from $L^p$ to Orlicz spaces is not much extra work. With this in mind, I wanted to write a bit about Orlicz spaces.
## Definitions
Let $\Psi$ denote the space of positive, increasing, convex functions $\psi: [0, \infty) \to [0, \infty)$ which satisfy $\psi(0)=0$. The simplest examples are to think of $\psi(x) = |x|^p$ for $p \geqslant 1$. For a bit more spice, take $\psi_a(x) = \exp(x^a) - 1$ for $a \geqslant 1$. There are many other possibilities as well, though I have not seen them so much in practice.
Fixing a probability measure $\mu$ and a function $\psi \in \Psi$, the $\psi$-Orlicz space over $\mu$ can be viewed as the linear span of functions $f$ such that $\int \mu(\mathrm{d} x) \cdot \psi( | f(x) | ) < \infty$. There are a few ways to define a norm on this space; the one which I typically find easiest to work with is
\begin{align}
\| f \|_{L^\psi ( \mu)} := \inf \left\{ t > 0 : \int \mu(\mathrm{d} x) \cdot \psi \left( \frac{| f(x) |}{t} \right) \leqslant 1 \right\}.
\end{align}
It is a nice exercise to convince oneself that this is indeed a norm, as this allows one to see where some of the assumptions on $\psi$ come from. Note that for $\psi(x) = |x|^p$, this recovers the usual definition of $L^p ( \mu)$, so we seem to have meaningfully generalised the notion.
## Applications
A basic application of Orlicz norms is to estimate tail probabilities of random variables. Roughly speaking, if $f$ has a finite $L^\psi$ norm, then $\psi \circ |f|$ is integrable, and so the tails of $f$ should decay like $1/\psi$. Indeed, write
\begin{align}
\mathbf{P} \left( |f(X)| \geqslant t \right) &= \int \mu(\mathrm{d} x) \cdot \mathbf{I} \left[ \frac{| f(x) |}{\| f \|_\psi} \geqslant \frac{t}{\| f \|_\psi} \right] \\
&\leqslant \int \mu(\mathrm{d} x) \cdot \frac{\psi \left( \frac{| f(x) |}{\| f \|_\psi} \right)}{\psi \left( \frac{t}{\| f \|_\psi} \right)} \\
&\leqslant \psi \left( \frac{t}{\| f \|_\psi} \right)^{-1}.
\end{align}
One can basically recognise this as Markov's inequality.
There are also particularly clean ways in which to bound the Orlicz norm of a maximum of independent and identically-distributed random variables, where having a finer understanding of the tail properties can pay off nicely.
## Truncation Arguments
(this example is more or less developed in [this paper](https://perso.math.univ-toulouse.fr/cattiaux/files/2013/11/alea.pdf), which is **not** mine)
I recently had cause to use Orlicz norms in the following context: I had access to a functional inequality which only held for bounded functions, but wanted to derive consequences for unbounded functions.
Roughly speaking, I was working with a linear operator $P$ on a Hilbert space of functions $L^2 ( \mu )$, which satisfies $\| P h \|_2 \leqslant \| h \|_2$ for all $h \in L^2 ( \mu )$, and $\| P h - \mu(h) \|_2 \leqslant c \cdot \| h \|_{\mathrm{osc}}$ for all bounded $h$ and some small constant $c$.
The hope was then that for $h \in \mathcal{H}$ which are unbounded but have thin tails (in an appropriate sense), a qualitatively similar bound ought to hold.
For now, let us work with positive $h$, and attempt a truncation argument, i.e. fix some $H > 0$, and write
\begin{align}
h &= h \cdot \mathbf{I} \left[ h \leqslant H \right] + h \cdot \mathbf{I} \left[ h > H \right] \\
&=: h_{\leqslant H} + h_{>H}.
\end{align}
Now, decompose
\begin{align}
\| P h - \mu(h) \|_2 &\leqslant \| Ph - Ph_{\leqslant H} \|_2 \\
&+ \| Ph_{\leqslant H} - \mu(h_{\leqslant H}) \|_2 \\
&+ \| \mu(h_{\leqslant H}) - \mu(h) \|_2.
\end{align}
We can always bound the second term as
\begin{align}
\| Ph_{\leqslant H} - \mu(h_{\leqslant H}) \|_2 \leqslant c \cdot \| h_{\leqslant H} \|_{\mathrm{osc}} \leqslant c \cdot H.
\end{align}
Rewrite the first term as $\| Ph - Ph_{\leqslant H} \|_2 = \| P h_{> H} \|_2 \leqslant \| h_{> H} \|_2$, and the third term as $\| \mu(h_{\leqslant H}) - \mu(h) \|_2 = | \mu ( h_{>H} ) | \leqslant \| h_{> H} \|_2$. To control these terms, we need to understand the tails of $h$.
To this end, assume that $h$ has a finite $\psi$-Orlicz norm for some $\psi \in \Psi$ which grows faster than quadratically, i.e. such that $x \mapsto x^{-2} \psi (x)$ is increasing. Then it holds that
\begin{align}
\| h_{> H} \|_2^2 &= \int \mu ( \mathrm{d} x) \cdot h(x)^2 \cdot \mathbf{I} \left[ h > H \right] \\
&\leqslant H^2 \cdot \psi \left( \frac{H}{\| h \|_\psi} \right)^{-1} \cdot \int \mu ( \mathrm{d} x) \cdot \psi \left( \frac{ |h(x)| }{\| h \|_\psi} \right) \\
&\leqslant H^2 \cdot \psi \left( \frac{H}{\| h \|_\psi} \right)^{-1}.
\end{align}
Writing $H = \| h \|_\psi \cdot t$, it thus holds that
\begin{align}
\| P h - \mu(h) \|_2 \leqslant \inf_{t > 0} \left\{ c \cdot t + 2 \cdot t \cdot \psi(t)^{-1/2}\right\} \cdot \| h \|_\psi,
\end{align}
and so by setting the cutoff $H$ (equivalently, $t$) appropriately, one can estimate the operator norm of $P - \mu$ from $L^\psi$ to $L^2$.
For a generic estimate, take $t = \psi^{-1} ( 1 / \delta^2 \cdot c^2)$ to upper bound the infimum $(1 + 2 \cdot \delta) \cdot c \cdot \psi^{-1} ( 1 / \delta^2 \cdot c^2)$. In the $L^p$ case, taking $\delta = 1$ thus yields an estimate of $3 \cdot c^{1-2/p}$, which is not too bad. With $\psi (x) = \exp (x) - 1$, this improves to $3 \cdot c \cdot \log \left(1 + \frac{1}{c^2} \right)$.
Actually, the most annoying part for me is that when $\psi(x) = x^2$, I pick up this weird factor of $3$ out the front, when I know that I should be able to put a $1$ there. Even with smarter optimisation over $t$, I can only get it down to a $2$, so the issue comes earlier; I think that perhaps this loss is inevitable if one uses the triangle inequality right at the beginning.
## Some Other Stuff
There are a number of nice interpolation inequalities involving $L^p$ norms, i.e. since when $1 \leqslant p < q < r$, it holds on $\mathbf{R}$ that $|x|^q \lesssim |x|^p + |x|^r$, it is not so difficult to find bounds of the form $| x |_q \leqslant F_{q \mid p, r} \left( \| x \|_p, \| x \|_r \right)$. I expect that a similar story can be told for Orlicz norms, but I'm still finding my feet a bit.
There are also a bunch of nice functional inequalities (e.g. generalised Sobolev inequalities) which can be Orlicz-ified; these can be used to gain a finer understanding of how different Markov processes converge to equilibrium. For readers who are familiar with the notion of hypercontractivity, there is also an Orlicz version of that.
## Conclusion
For me, I am still not often interested in particular non-standard $L^p$ norms, or in other more exotic Orlicz norms. However, when developing mathematical techniques or inequalities which work in *one* of these settings, I do feel that it is useful exercise to see whether they might actually work in *all* of those spaces. I get the sense that it provides a fuller picture of what is going on in a problem, and how finely the conclusions depend on e.g. tail behaviours. I'm going to try to retain it as a habit, at least for a bit.

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**