Algebraic Softening of Maximum Functions

###### tags: `comments and reviews` `one-offs` # Algebraic Softening of Maximum Functions **Overview**: A recent [Twitter thread](https://twitter.com/jon_barron/status/1387167648669048833) advocated for possibly replacing the standard softplus function $x \mapsto \log \left( 1 + \exp x \right)$ by the 'squareplus' function $x \mapsto \frac{1}{2} \left( x + \sqrt{x^2+4} \right)$, as it avoids computation of transcendental functions and is 'all-algebraic', which may have certain benefits. Here, I discuss some context and extensions surrounding this option. ## Introduction In modern scientific computing, there is often something to be gained by working with smooth, differentiable functions, due in part to the advent of automatic differentiation as an all-purpose tool. In contrast, it remains useful to work with functions which have a distinctly non-smooth character, such as the positive-part function $x \mapsto \max(x, 0)$, and the maximum function $(x_1, \ldots, x_N) \mapsto \max(x_1, \ldots, x_N)$. A natural strategy is thus to seek smooth functions which are good approximations to these non-smooth functions, while remaining simple to implement and fast at runtime. ## Probabilistic View on Softmax and Softplus One approach to approximating the maximum function is to consider probability distributions $p$ which are defined over the set $[N] = \{ 1, \ldots, N \}$, and which take the value $n$ with probability $\propto f(x_n)$ for some appropriate monotone function $f$. Working with probability distributions naturally has a sort of smoothing effect, which partially justifies why we might expect this to be a good approach. We can consider taking $f(x) = \exp(\beta x)$ with $\beta \gg 1$. This will give rise to a $p$ which concentrates most of its mass on $\arg\max_n x_n$: \begin{align} \mathbf{P}_\beta (n) &= \exp( \beta x_n - A (\beta )) \\ A(\beta) &= \log \left( \sum_{m = 1}^N \exp (\beta x_m) \right). \end{align} This is also known as a *Boltzmann distribution* (also *exponential family*), and can be derived as the minimiser of the functional \begin{align} \mathcal{F} (q) = \mathbb{E}_{q(n)} \left[- x_n \right] + \beta^{-1} \text{KL} (q, u) \end{align} where $u$ is the uniform distribution over $[N]$, and $\text{KL}$ is the relative entropy functional. Studying this log-normalising constant $A$, one can see that as $\beta$ grows, it holds that \begin{align} \frac{1}{\beta} A (\beta) \to \max(x_1, \ldots, x_N), \end{align} and it can be verified that $A$ is also smooth[^1]. As such, we have a feasible candidate for a smooth approximation to the maximum function. Note also that we can take $x_0 = 0, x_1 = x$ to approximate $\max(0, x) \approx \beta^{-1} \log (1 + \exp (\beta x))$. ## A Probabilistic Derivation via Pre-Processing In the aforementioned Twitter thread, the following is proposed as an alternative to softplus: \begin{align} \text{sq}(x) = \frac{1}{2} \left( x + \sqrt{x^2+4} \right). \end{align} In the interests of scale-invariance, we also consider the following generalisation: \begin{align} \text{sq}_\varepsilon(x) = \frac{1}{2} \left( x + \sqrt{x^2+4 \varepsilon^2} \right), \end{align} defined such that $\text{sq}_\varepsilon(x) = \varepsilon \cdot \text{sq} (x / \varepsilon)$. Initially, this expression appears to be quite different to softplus, in the absence of an accompanying derivation. Moreover, the generalisation to an analogue of softmax is not immediate. In any case, one can compute that \begin{align} \text{sq}_\varepsilon (2 \varepsilon \sinh \psi) &= \varepsilon \cdot \exp \psi \\ \implies \text{sq}_\varepsilon (x) &= \varepsilon \cdot \exp \left( \text{ar}\sinh \left(\frac{x}{2\varepsilon}\right) \right). \end{align} As such, we might view $\text{sq}$ as the minimiser of the functional \begin{align} \mathcal{F} (q) = \mathbb{E}_{q(n)} \left[- \text{ar}\sinh \left(\frac{x_n}{2\varepsilon}\right) \right ] + \beta^{-1} \text{KL} (q, u) \end{align} with $\varepsilon = \beta = 1$. That is, we construct a probability measure which favours small values of $x_n$, but far more gently than in the standard Boltzmann case. This is analogous to certain techniques in robust statistics, which seek to temper the influence of the likelihood for outlying observations. ## Roadblocks to Squaremax With this in mind, we might hope to assert that as $\varepsilon \to 0$, we have that \begin{align} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) &\approx \varepsilon \exp \left( \text{ar}\sinh \left(\frac{\max(x_1, \ldots, x_N) }{2\varepsilon}\right) \right). \end{align} Unfolding this relation, we would make the approximation \begin{align} \max(x_1, \ldots, x_N) &\approx 2\varepsilon \sinh \left( \log \left( \frac{1}{\varepsilon} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) \right) \right) \\ &= \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) - \varepsilon^2 \left( \sum_{n = 1}^N \text{sq}_\varepsilon (x_n) \right)^{-1}. \end{align} However, this approximation is not true, and not particularly good, unless the largest of the $x_n$ is much bigger than all of the others. For example, if all of the $x_n$ are of the same order (i.e. $\max_n x_n / \min_n x_n = \mathcal{O} (1))$, then even for small $\varepsilon$, this 'pseudo-maximum' will behave more like $N$ times their average positive part. As distinct from the Boltzmann case, the largest of the $x_n$ is not highlighted to the same extent. Essentially, the slow logarithmic growth of $\text{ar}\sinh$ flattens the differences between the terms. One way around this is to introduce an additional parameter $\beta$ and write \begin{align} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta &= \varepsilon \exp \left( \beta \text{ar}\sinh \left(\frac{\max(x_1, \ldots, x_N) }{2\varepsilon}\right) \right) \cdot (1 + o(1)). \end{align} Similar derivations are rewarded with \begin{align} \max(x_1, \ldots, x_N) &\approx 2\varepsilon \sinh \left( \frac{1}{\beta} \log \left( \frac{1}{\varepsilon} \sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta \right) \right) \\ &= \left(\sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta\right)^{1/\beta} - \varepsilon^2 \left( \sum_{n = 1}^N \text{sq}_\varepsilon (x_n)^\beta \right)^{-1/\beta}. \end{align} In the setting where the $x_n$ are of comparable order, the resulting pseudo-maximum will thus be of order $N^{1 /\beta} \cdot x_+$. For $\beta \gtrsim \log N$, the effect of $N$ can be diminished, but the ability to discriminate between the $x_n$ remains limited. However, the quantities involved remain 'all-algebraic' in the spirit of the original proposal. ## Conclusion The Squareplus function proposed in [the original thread](https://twitter.com/jon_barron/status/1387167648669048833) seems to be an appropriate replacement for the Softplus function. However, a similar 'all-algebraic' replacement for the Softmax function is less straightforward, as algebraic functions struggle to discriminate between inputs of similar order, as distinct from the exponential function. This may still be of interest in applications, as the role of Softmax is not always to simply approximate the maximum function. It may be useful to consider variants of this construction which are location-invariant, rather than scale-invariant. [^1]: $A$ is also known as a *cumulant-generating function*, and has many other useful properties as well