owned this note
owned this note
Published
Linked with GitHub
###### tags: `deep models` `one-offs`
# Compositional Construction of Functions with Higher-Order Monotonicity Properties
**Overview**: In this note, I describe some approaches to parametrising functions which have the property of 'higher-order monotonicity', i.e. that a finite number of derivatives of the function are nonnegative. This is related to recent developments in the design of neural networks which are convex functions.
## Compositional Parametrisation of Functions
In light of the success of deep learning-based methods in recent years, there is substantial interest in parametrising the solutions to various tasks with neural networks. An essential aspect of neural networks is their *compositional nature*; that is, they are constructed as the composition of many elementary functions, i.e.
\begin{align}
f(x) = f_{L} \circ f_{L-1} \circ \cdots f_{1}
\end{align}
where for $1 \leqslant \ell \leqslant L$, we take $f_{\ell} (x) = \sigma(A_\ell x + b_\ell)$ for some matrix $A_\ell$, some vector $b_\ell$, and some fixed 'activation function' $\sigma$, e.g. $\sigma(x) = \max(x, 0)$.
A challenge of the compositional paradigm is that it is qualitatively different to the classical approach of parametrising functions as being in the span of some linear basis. As such, enforcing structural constraints on the composite function $f$ requires new approaches, but also poses new opportunities.
## Higher-Order Monotonicity
Let $K \in \mathbf{N}$. Define the class of $K$-monotone functions on $\mathbf{R}$ to be
\begin{align}
\mathcal{F}_K = \{ f \in C^K (\mathbf{R}) : D^k f(x) \geqslant 0 \text{ on } \mathbf{R} \text{ for } 1 \leqslant k \leqslant K\}.
\end{align}
Note that $k = 0$ is *not* included in the definition, i.e. the functions in this class need not be nonnegative. For simplicity, I have also assumed that $D^K f$ is continuous; in principle, this assumption could be removed.
These function classes are **not** vector spaces, but they are convex (more precisely, they form a convex cone). Note also that these classes are nested, i.e.
\begin{align}
\mathcal{F}_1 \subsetneq \mathcal{F}_2 \subsetneq \cdots
\end{align}
and their union is known as the class of absolutely-monotone functions.
For some concrete examples,
* $\mathcal{F}_1$ is the class of nondecreasing functions on $\mathbf{R}$, and
* $\mathcal{F}_2$ is the class of nondecreasing, convex functions on $\mathbf{R}$.
Note that all sufficiently-smooth one-dimensional convex functions can be written as $f(x) = g(x) + h (-x)$ where $g, h \in \mathcal{F}_2$.
There is extensive work in the statistical literature on nonparametric regression under both monotonicity and convexity constraints. It can be demonstrated that by making these structural assumption, one can construct estimators with improved rates of convergence. Moreover, there are many applications for which these assumptions are natural. For $K \geqslant 3$, the statistical relevance is not as clear to me, but one can expect that similar theoretical results should apply.
From the compositional viewpoint, these classes are very nice, due to the following lemma.
**Lemma**: Let $K \in \mathbf{N}$, and let $f, g \in \mathcal{F}_K$. Then $f \circ g \in \mathcal{F}_K$.
**Proof**: Faà di Bruno's formula tells us that the $k^\text{th}$ derivative of $f \circ g$ can be written as a polynomial in $\{ (D^m f) \circ g \}_{1 \leqslant m \leqslant k} \cup \{ D^m g\}_{1 \leqslant m \leqslant k}$ with nonnegative coefficients, i.e.
\begin{align}
\left( D^k ( f \circ g) \right)(x) = \sum_{\mathbf{i}, \mathbf{j}} \alpha_{\mathbf{i}, \mathbf{j}} \cdot \prod_{1 \leqslant m \leqslant k} (D^m f) (g(x))^{i_m} \cdot \prod_{1 \leqslant n \leqslant k} (D^n g) (x)^{j_n}
\end{align}
with $\alpha_{\mathbf{i}, \mathbf{j}} \geqslant 0$. By assumption, for $1 \leqslant m, n \leqslant K$, it holds for all $x \in \mathbf{R}$ that $(D^m f) (g(x)) \geqslant 0, (D^m g) (x) \geqslant 0$, from which it follows that $\left( D^k ( f \circ g) \right)(x) \geqslant 0$ for $1 \leqslant k \leqslant K$. The result follows.
## An Explicit Construction
We now turn to explicitly parametrising functions in $\mathcal{F}_K$. Suppose that we fix an activation function $\sigma \in \mathcal{F}_K$, e.g. $\sigma(x) = \max(x, 0)^K$ (ignoring for now the discontinuity in the $K^\text{th}$ derivative). By elementary considerations, we know that for any $a \geqslant 0, b \in \mathbf{R}$, it holds that $x \mapsto \sigma(ax+b)$ is also in $\mathcal{F}_K$. As such, we can take
\begin{align}
\mathbf{a}_1 = ( a_{1,1}, a_{1, 2}, \ldots, a_{1, N_1} ) &\in \mathbf{R}_{\geqslant 0}^{N_1} \\
\mathbf{b}_1 = ( b_{1,1}, b_{1, 2}, \ldots, b_{1, N_1} ) &\in \mathbf{R}
\end{align}
and define
\begin{align}
f_1 &: \mathbf{R} \to \mathbf{R}^{N_1} \\
&: x \mapsto \sigma ( \mathbf{a}_1 x + \mathbf{b}_1 )
\end{align}
where $\sigma$ is applied coordinate-wise.
Now, since each component of $f_1$ is in $\mathcal{F}_K$, we can apply this construction recursively. Let $\mathbf{A}_2 \in \mathbf{R}_{\geqslant 0}^{N_1 \times N_2}, \mathbf{b}_2 \in \mathbf{R}^{N_2}$, and define
\begin{align}
f_2 &: \mathbf{R}^{N_1} \to \mathbf{R}^{N_1 \times N_2} \\
&: x \mapsto \sigma ( \mathbf{A}_2 x + \mathbf{b}_2 ).
\end{align}
Our earlier results guarantee that the components of $f_2 \circ f_1$ will again be elements of $\mathcal{F}_K$.
Constructing $f_3, \ldots, f_L$ in a similar fashion, we end up with a map $f_L \circ \cdots \circ f_1 : \mathbf{R} \to \mathbf{R}^{N_l}$. To recover a one-dimensional map $F \in \mathcal{F}_K$, it suffices to take some $\mathbf{a}_\text{Last} \in \mathbf{R}_{\geqslant 0}^{N_L}$ and define
\begin{align}
F(x) = \langle \mathbf{a}_\text{Last}, \left( f_L \circ \cdots \circ f_1 \right) (x) \rangle.
\end{align}
## Residual Connections and ODE Flows
In conventional neural networks, a popular module is the so-called 'residual connection', given by
\begin{align}
\text{Res}_{\varepsilon, F} (x) = x + \varepsilon F (x).
\end{align}
This allows the evolution of the state along the evaluation of the neural network to be more gradual, and can improve certain aspects of the optimisation problem.
A similar technique can be applied here. Suppose that $x \in \mathbf{R}$, and $F \in \mathcal{F}_K$. Because $x \mapsto x \in \mathcal{F}_K$ for all $K \geqslant 1$ and $\mathcal{F}_K$ is a convex cone, it is straightforward to deduce that $\text{Res}_{\varepsilon, F} \in \mathcal{F}_K$ as well.
A common trick in this area is to consider taking $\varepsilon \to 0$ in the above formulation, and recover an ordinary differential equation. In our context, this would correspond to an ODE
\begin{align}
\dot{x} = F(x, t)
\end{align}
where for each $t \geqslant 0$, the mapping $x \mapsto F (x, t)$ is an element of $\mathcal{F}_K$. A byproduct of our earlier analysis is that for any $t \geqslant 0$, the ODE flow which maps $x(0)$ to $x(t)$ is itself an element of $\mathcal{F}_K$.
## Towards High Dimensions
The above constructions are very clean, at least partially because they are essentially one-dimensional in nature. While convex mappings from $\mathbf{R}^d$ to $\mathbf{R}$ exist, they cannot be composed in such interesting ways, and one eventually returns to dealing with convex mappings from $\mathbf{R}$ to itself. This seems to be roughly what happens in the line of research on Input-Convex Neural Networks, though it is not always framed this way.
An exciting avenue is to develop multivariate notions of (higher-order) monotonicity which behave nicely under composition. I am not yet aware of such notions which are practically relevant.