# The Averaging Theory Behind Narrow Neural ODEs
In this note, we consider the averaging theory behind approximation properties of Residual networks. Turns out the theory of approximation properties for neural networsks for arbitrary width but limited depth is well understood. For this reason, we are particularly interested in the case where one has very limited width in the network but arbitrary depth.
To address this question, we consider a differential equation point of view. Consider a [neural ordinary differential equation](https://arxiv.org/abs/1806.07366) (NODE) of the form
\begin{eqnarray}
\dot{x}(t) = f(x(t),\theta(t)) \\
x(0) = x_0
\end{eqnarray}Here, $\theta (t)$ are weight parameters and the time variable $t$ denotes the depth parameter, and $f(x,\theta)$ is some function approximator. NODEs are continuous versions of Residual neural networks, where the time parameter is associated with the depth of the network.
A standard question one can ask is in this setting, is what kind of maps one can approximate using the flow map $\Phi(x_0):=x_0 \mapsto x(T)$ of this ordinary differential equation (ODE). This is the continuous version of the question in neural network theory. *When can a compositions and superpositions of parameterized class of functions be used to approximate a given map*?
The easiest situation is when $f(x,\theta)$ can approximate any function $V(x)$. In that case, one can conclude that any flow map of the differential equation,
\begin{eqnarray}
\dot{x}(t) = V(x(t),t) \\
x(0) = x_0
\end{eqnarray}can be approximated using that of the neural ODE. Particularly, approximate $V(x,t)$ using $f(x,\theta(t))$ for each $t$. And hence, using standard results of ODEs, one approximates the flow of the reference differential equation using that of the NODE.
An interesting situation arises when $f(x,\theta)$ is *skinny* and of the form (in the one-dimensional case)
\begin{eqnarray}
\dot{x}(t) = a(t)\sigma(w(t)x(t)+b(t)) \\
x(0) = x_0
\end{eqnarray} where $\sigma$ is an activation function, and $\theta(t):=\big (a(t),w(t),b(t) \big )$ are time-dependent scalars. In this case, it is no longer true that $f(x,\theta) = a\sigma(wx+b)$ can approximate any vector field $V(x)$. We call this neural ODE *narrow* or *skinny* because at any time $t$ only one neuron is used to manipulate the flow of the ODE.
Compare this to the *thicker version* of the neural ode,
\begin{eqnarray}
\dot{x}(t) = \sum_{i}a_i(t)\sigma(w_i(t)x(t)+b_i(t))
\end{eqnarray}One clearly has more degree of control here due to the presence of multiple neurons at any given time instant.
**From wide neural network to narrow neural network**
For sufficiently rich class of activation functions $\sigma$, $V(x)$ can be approximated by linear sum of the form $\sum_{i}a_i\sigma(w_ix(t)+b_i)$, which are single layer neural networks. So one can ask the following intermediate question,
*Can the flow of the differential equation*
\begin{eqnarray}
\dot{x}(t) = a(t)\sigma(w(t)x(t)+b(t))
\end{eqnarray}*approximate the flow of the differential equation*
\begin{eqnarray}
\dot{x}(t) = \sum_{i}a_i\sigma(w_ix(t)+b_i)
\end{eqnarray}?
If the answer for this question is affirmative, one can then use the well known fact from theory of neural networks that $\sum_{i}a_i\sigma(w_ix+b_i)$ can approximate arbitrary vector fields $V(x)$.
Therefore, to answer this question we take a detour.
**Averaging Theory of Dynamical Systems**
Consider the behavior of systems with very high frequency oscillations, like the flapping wings of a humming bird.

Averaging theory considers the problem of when can a differential equation with a highly oscillatory vector field $V$,
\begin{equation}
\dot{x}(t) = V(x,\frac{t}{\epsilon})
\end{equation}be approximated using a differential equation of the corresponding time averaged vector field,
\begin{equation}
\dot{y}(t) = \frac{1}{T} \int^T_0 V(y,t)dt \equiv \bar{V}(y)
\end{equation}
Can a non-autonomous system like a flapping wing be understood by its time averaged approximation? Turns out this is indeed possible in many situations. For averaging theory of flapping wings [see here](https://ieeexplore.ieee.org/document/1668260). A typical *averaging theorem* says that the solutions of the two systems are close under some mild conditions on $V$, for $\epsilon$ close to $0$. [See](https://en.wikipedia.org/wiki/Method_of_averaging).
Another instance of such a result that arises is in control theory and numerical approximations of ODEs under the guise of controllability and operator splitting, respectively, where one wonders how can the solution of differential equations of the form
\begin{equation}
\dot{x}(t) = \sum_iV_i(x)
\end{equation}be approximated by sequentially switching between differential equations of associated with individual vector fields \begin{equation}
\dot{x}(t) = V_i(x)
\end{equation}For at least two vector fields $V_1, V_2$ one can say things like $e^{t V_1} e^{tV_1} \approx e^{tV_1+tV_2}$ for small $t$, where the exponential notation $e^{tV}$ denotes flow map of $V$ evaluated at time $t$. See for example, the [Baker-Campbell-Hausdorff formula](https://en.wikipedia.org/wiki/Baker%E2%80%93Campbell%E2%80%93Hausdorff_formula).
**Averaging Theory for Narrow NODEs**
One can apply the same idea for considering the approximation property of a skinny or narrow neural ODEs.
By using the averaging principle, one can say the following:
***Informal Theorem** The flow of the differential equation*
\begin{eqnarray}
\dot{x}(t) = a(t)\sigma(w(t)x(t)+b(t))
\end{eqnarray}*can approximate arbitrarily well flows of differential equations of the form*
\begin{eqnarray}
\dot{x}(t) = \sum_{i}a_i\sigma(w_ix(t)+b_i).
\end{eqnarray}
**Sketch of Proof**
We show how the result applies to a specific example. The generalization being obvious. Define the time-periodic vector field
\begin{equation}
Q(x,t+mT) = \begin{cases}a_1\sigma(w_1x+b_1), ~~ t \in [0,T/2) \\ a_2\sigma(w_2x+b_2), ~~ t \in [T/2,T)
\end{cases}
\end{equation} for $i = 1,2$, $m=1,2,... \infty$.
Consider the non-autonomous ODE,
\begin{equation}
\dot{x}(t) = Q(x,\frac{t}{n})
\end{equation} Then from averaging theory it follows that the solution of this ODE will converge on compact intervals to solutions of the time averaged ODE
\begin{align}
\dot{y}(t) &= \frac{1}{T}\int_0^T Q(y(t),\tau)d\tau \\
&= \frac{1}{2}a_1\sigma(w_1 y(t) +b_1) + \frac{1}{2}a_2\sigma(w_2 y(t) +b_2)
\end{align} As visualized in the following schema, it is clear that $Q(x,\frac{t}{n})$ is always equal to $a_i\sigma(w_ix+b_i)$ for some $i=1,2$. Hence, the result follows.
|
|:----------------------------------:|
|Example sequence of oscillatory vector-fields approximating the behavior of the time average, by switching between two different set of weights over shorter and shorter time intervals.|
Therefore, if $\sum_{i}a_i\sigma(w_ix+b_i)$ can approximate a function $V(x)$, one can say the following approximation property.
|
|:----------------------------------:|
|Two dimensional vector field of Neural ODEs for two different choices of weights (blue and red) and it's time-averaged flow (green).|
***Informal Theorem** For sufficiently rich class of activation functions $\sigma$, the flow of the differential equation*
\begin{eqnarray}
\dot{x}(t) = a(t)\sigma(w(t)x(t)+b(t))
\end{eqnarray}*can approximate arbitrary well flows of differential equations of the form*
\begin{eqnarray}
\dot{x}(t) = V(x,t).
\end{eqnarray}
**Numerical Visualization**
This result has been visualized in the following plot generated by training a two dimensional narrow neural ODE in python. Here the reference flow of a two-dimensional is taking the system across straight lines and the neural ODE is trying to approximate this behavior by simultaneuously guiding initial conditions along the respective lines. The jagged lines are the solutions of the neural ODE.
|
|:----------------------------------:|
| Neural ODE guiding two solutions (red) along straight lines (green).|
|
|:----------------------------------:|
| Neural ODE guiding two solutions (red) along straight lines (green). Since the narrow NODE is trying to guide five solutions simultaneously, it does a poorer job than the case of two solutions.|
**Some Control Theory Weirdness**
If one is familiar with [controllability](https://en.wikipedia.org/wiki/Controllability), note that one can think of this flow approximation problem as a controllability problem. However, one is trying to control *multiple initial conditions* to their goal final conditions using the same control input $\theta(t)$. In control theory parlance, one can see this as an instance of simulatenous controllability, which is significantly more challenging than a classical instance of controllability, where one takes one initial condition to another. The fact that NODEs allow not just to control initial conditions, but the flows themselves, using open loop controls is wild!
**References**
Some of the first investigations on approximation properties of Narrow Neural ODEs can be found in
*[Universal approximation power of deep residual neural networks through the lens of control - Paulo Tabuada, Bahman Gharesifard](https://ieeexplore.ieee.org/document/9827563)*
*[Neural ODE control for classification, approximation, and transport - Domènec Ruiz-Balet, Enrique Zuazua](https://epubs.siam.org/doi/10.1137/21M1411433)*
The averaging point of view that is presented here has been formalized through **weak convergence** arguments in,
*[Neural ODE control for trajectory approximation of continuity equation - Karthik Elamvazhuthi, Bahman Gharesifard, Andrea L. Bertozzi, Stanley Osher](https://ieeexplore.ieee.org/abstract/document/9794313?casa_token=ORFYNgRaHjEAAAAA:zIRiN2h16ekrUtLgjz3Isn1Rnfd3KB_B01eT8YlwcHMuD_A9VxJjM1OcHu7EKpZnhMCoWtRB-w)*
*[Learning on Manifolds: Universal Approximations Properties using Geometric Controllability Conditions for Neural ODEs](https://proceedings.mlr.press/v211/elamvazhuthi23a.html)*