###### tags: `comments and reviews` `one-offs`
# Algebraic Softening of Maximum Functions
**Overview**: A recent [Twitter thread](https://twitter.com/jon_barron/status/1387167648669048833) advocated for possibly replacing the standard softplus function $x \mapsto \log \left( 1 + \exp x \right)$ by the 'squareplus' function $x \mapsto \frac{1}{2} \left( x + \sqrt{x^2+4} \right)$, as it avoids computation of transcendental functions and is 'all-algebraic', which may have certain benefits. Here, I discuss some context and extensions surrounding this option.
## Introduction
In modern scientific computing, there is often something to be gained by working with smooth, differentiable functions, due in part to the advent of automatic differentiation as an all-purpose tool. In contrast, it remains useful to work with functions which have a distinctly non-smooth character, such as the positive-part function $x \mapsto \max(x, 0)$, and the maximum function $(x_1, \ldots, x_N) \mapsto \max(x_1, \ldots, x_N)$. A natural strategy is thus to seek smooth functions which are good approximations to these non-smooth functions, while remaining simple to implement and fast at runtime.
## Probabilistic View on Softmax and Softplus
One approach to approximating the maximum function is to consider probability distributions $p$ which are defined over the set $[N] = \{ 1, \ldots, N \}$, and which take the value $n$ with probability $\propto f(x_n)$ for some appropriate monotone function $f$. Working with probability distributions naturally has a sort of smoothing effect, which partially justifies why we might expect this to be a good approach.
We can consider taking $f(x) = \exp(\beta x)$ with $\beta \gg 1$. This will give rise to a $p$ which concentrates most of its mass on $\arg\max_n x_n$:
\begin{align}
\mathbf{P}_\beta (n) &= \exp( \beta x_n - A (\beta )) \\
A(\beta) &= \log \left( \sum_{m = 1}^N \exp (\beta x_m) \right).
\end{align}
This is also known as a *Boltzmann distribution* (also *exponential family*), and can be derived as the minimiser of the functional
\begin{align}
\mathcal{F} (q) = \mathbb{E}_{q(n)} \left[- x_n \right] + \beta^{-1} \text{KL} (q, u)
\end{align}
where $u$ is the uniform distribution over $[N]$, and $\text{KL}$ is the relative entropy functional.
Studying this log-normalising constant $A$, one can see that as $\beta$ grows, it holds that
\begin{align}
\frac{1}{\beta} A (\beta) \to \max(x_1, \ldots, x_N),
\end{align}
and it can be verified that $A$ is also smooth[^1]. As such, we have a feasible candidate for a smooth approximation to the maximum function.
Note also that we can take $x_0 = 0, x_1 = x$ to approximate $\max(0, x) \approx \beta^{-1} \log (1 + \exp (\beta x))$.
## A Probabilistic Derivation via Pre-Processing
In the aforementioned Twitter thread, the following is proposed as an alternative to softplus:
\begin{align}
\text{sq}(x) = \frac{1}{2} \left( x + \sqrt{x^2+4} \right).
\end{align}
In the interests of scale-invariance, we also consider the following generalisation:
\begin{align}
\text{sq}_\varepsilon(x) = \frac{1}{2} \left( x + \sqrt{x^2+4 \varepsilon^2} \right),
\end{align}
defined such that $\text{sq}_\varepsilon(x) = \varepsilon \cdot \text{sq} (x / \varepsilon)$.
Initially, this expression appears to be quite different to softplus, in the absence of an accompanying derivation. Moreover, the generalisation to an analogue of softmax is not immediate.
In any case, one can compute that
\begin{align}
\text{sq}_\varepsilon (2 \varepsilon \sinh \psi) &= \varepsilon \cdot \exp \psi \\
\implies \text{sq}_\varepsilon (x) &= \varepsilon \cdot \exp \left( \text{ar}\sinh \left(\frac{x}{2\varepsilon}\right) \right).
\end{align}
As such, we might view $\text{sq}$ as the minimiser of the functional
\begin{align}
\mathcal{F} (q) = \mathbb{E}_{q(n)} \left[- \text{ar}\sinh \left(\frac{x_n}{2\varepsilon}\right) \right ] + \beta^{-1} \text{KL} (q, u)
\end{align}
with $\varepsilon = \beta = 1$. That is, we construct a probability measure which favours small values of $x_n$, but far more gently than in the standard Boltzmann case. This is analogous to certain techniques in robust statistics, which seek to temper the influence of the likelihood for outlying observations.
## Roadblocks to Squaremax
With this in mind, we might hope to assert that as $\varepsilon \to 0$, we have that
\begin{align}
\sum_{n = 1}^N \text{sq}_\varepsilon (x_n) &\approx \varepsilon \exp \left( \text{ar}\sinh \left(\frac{\max(x_1, \ldots, x_N) }{2\varepsilon}\right) \right).
\end{align}
Unfolding this relation, we would make the approximation
\begin{align}
\max(x_1, \ldots, x_N) &\approx 2\varepsilon \sinh \left( \log \left( \frac{1}{\varepsilon} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) \right) \right) \\
&= \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) - \varepsilon^2 \left( \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) \right)^{-1}.
\end{align}
However, this approximation is not true, and not particularly good, unless the largest of the $x_n$ is much bigger than all of the others. For example, if all of the $x_n$ are of the same order (i.e. $\max_n x_n / \min_n x_n = \mathcal{O} (1))$, then even for small $\varepsilon$, this 'pseudo-maximum' will behave more like $N$ times their average positive part. As distinct from the Boltzmann case, the largest of the $x_n$ is not highlighted to the same extent. Essentially, the slow logarithmic growth of $\text{ar}\sinh$ flattens the differences between the terms.
One way around this is to introduce an additional parameter $\beta$ and write
\begin{align}
\sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta &= \varepsilon \exp \left( \beta \text{ar}\sinh \left(\frac{\max(x_1, \ldots, x_N) }{2\varepsilon}\right) \right) \cdot (1 + o(1)).
\end{align}
Similar derivations are rewarded with
\begin{align}
\max(x_1, \ldots, x_N) &\approx 2\varepsilon \sinh \left( \frac{1}{\beta} \log \left( \frac{1}{\varepsilon} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta \right) \right) \\
&= \left(\sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta\right)^{1/\beta} - \varepsilon^2 \left( \sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta \right)^{-1/\beta}.
\end{align}
In the setting where the $x_n$ are of comparable order, the resulting pseudo-maximum will thus be of order $N^{1 /\beta} \cdot x_+$. For $\beta \gtrsim \log N$, the effect of $N$ can be diminished, but the ability to discriminate between the $x_n$ remains limited. However, the quantities involved remain 'all-algebraic' in the spirit of the original proposal.
## Conclusion
The Squareplus function proposed in [the original thread](https://twitter.com/jon_barron/status/1387167648669048833) seems to be an appropriate replacement for the Softplus function. However, a similar 'all-algebraic' replacement for the Softmax function is less straightforward, as algebraic functions struggle to discriminate between inputs of similar order, as distinct from the exponential function. This may still be of interest in applications, as the role of Softmax is not always to simply approximate the maximum function. It may be useful to consider variants of this construction which are location-invariant, rather than scale-invariant.
[^1]: $A$ is also known as a *cumulant-generating function*, and has many other useful properties as well

Overview: In this note, I log some basic observations about diffusion-based generative models.

8/14/2023Overview: In this note, I describe some aspects of hierarchical structure in MCMC algorithms, and how they can be of theoretical and practical relevance.

8/9/2023Overview: In this note, I discuss a recurrent question which can be used to generate research questions about methods of all sorts. I then discuss a specific instance of how this question has proved fruitful in the theory of optimisation algorithms. Methods and Approximations A nice story is that when Brad Efron derived the bootstrap, it was done in service of the question “What is the jackknife an approximation to?”. I can't help but agree that there's something quite exciting about research questions which have this same character of ''What is (this existing thing) an approximation to?''. One bonus tilt on this which I appreciate is that there can be multiple levels of approximation, and hence many answers to the same question. One well-known example is gradient descent, which can be viewed as an approximation to the proximal point method, which can then itself be viewed as an approximation to a gradient flow. There are probably even more stops along the way here. In this case, there is even the perspective that from the perspective of mathematical theory, there may be at least as much to be gained by stopping off at the proximal point interpretation, as there is from the gradient flow perspective. My experience is that generalist applied mathematicians get to grips with the gradient flow quickly, but optimisation theorists can squeeze more out of the PPM formulation. There is thus some hint that using this 'intermediate' approximation can be particularly insightful in its own right. It would be interesting to collect more examples with this character.

5/22/2023Overview: In this note, I prove Hoeffding's inequality from the perspectives of martingales and convex ordering. The Basic Construction Let $-\infty<a<x<b<\infty$, and define a random variable $M$ with law $M\left(x;a,b\right)$ by \begin{align} M=\begin{cases} a & \text{w.p. }\frac{b-x}{b-a}\ b & \text{w.p. }\frac{x-a}{b-a}. \end{cases}

5/22/2023
Published on ** HackMD**

or

By clicking below, you agree to our terms of service.

Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet

Wallet
(
)

Connect another wallet
New to HackMD? Sign up