# Notes on Deep Equilibrium Models #### Author: [Sharath Chandra](https://sharathraparthy.github.io/) ## [Paper Link](https://arxiv.org/pdf/1909.01377.pdf) The present day deep learning architectures have this notion of explicit layers which model the relationship between inputs and outputs with the help of a well defined computational graph. For example, if we consider a one hidden layer feed forward network with sigmoid non-linearity, we can define the explicit hidden layer as a layer which performs the specified computation to get calculate the output $$ z_1 = \sigma(W_1x + b_1) $$ For deep networks these explicit layers are stacked together and are trained using backpropagation. Yet one needs to store the intermediate values to carry the backpropagation which puts a heavy burden on the memory considering the huge capacity modern deep learning architectures. This problem has driven a new research direction which aimed at being memory efficient without compromising on the powerful representational capacity that present deep architectures offer. Instead of specifying the underlying computational graph which a layer should perform, one can think of other class of layers where we can just specify the joint conditions that layer should satisfy to get the output. **Implicit layers**, unlike explicit layers, define some conditions that the inputs and outputs satisfy. To understand this consider a weight-tied neural network applied to an input $x$. A weight-tied network is that network which has shared parameter space and these are shown to work well in practice. So, the output at any hidden layer $L_i$ can be written as $z_i = \sigma(W z_{i - 1} + x)$ where $W$ is the common weight matrix for all the layers, $z_{i-1}$ is the output of layer $L_{i-1}$ and $\sigma$ is the nonlinear activation function. One can view this as a discrete time dynamical system with the hidden state $z_i$ evolving (discretely) through time. Assuming that there is a fixed point to such dynamical system, one can defined an implicit layer is the solution to the following nonlinear system $$ z^\star = \sigma(Wz^\star + x) $$ In fact this can be extended to any nonlinear function $f$. Given this structure of implicit layer ($z^\star = f(z^\star, x, \theta)$ where $\theta$ is the parameters of the implicit layer), our aim is to calculate the fixed point $z^\star$ by using any fixed point iteration solver. Since we are interested in memory efficiency, we need to have a mechanism to backpropagate through the layers without storing any intermediate values. This can be achieved by using Implicit function theorem \cite{implicit} which directly gives us the total derivative of the loss function $l(\theta)$ which is given by \begin{equation} \partial l(\theta) = \partial_{z^\star}(z^\star, y)(I - \partial_{z^\star}f(z^\star, x, \theta))\partial_x f(z^\star, x, \theta) \end{equation} Deep equilibrium models leverages this idea of implicit layers to more general settings where $f$ is a ResNet block, transformer block etc., and uses implicit function theorem to carry the backpropagation. In the forward pass, DEQ computes the equilibrium point $z^\star$ using a fixed point iteration solver like Anderson acceleration and then compute the loss function which depends on this fixed point. As the gradients we compute are invariant to the fixed point solvers, one can use any solver to calculate the fixed points. During the backpropagation, DEQ calculates the gradients using the equation above. One can view DEQ models as a single layer neural network which can represent any feed forward neural network. And furthermore since we are using a weight-tied neural network there is no exponential blow up of parameters in single layer universal function approximation theorems. One more view of these networks is to consider each the fixed point approximation step as an instantiation of a layer resulting in arbitrary depth (sometimes infinite depth) networks which are memory efficient. <!-- The paper considers one instantiation of the model $f$ which considers a layer as a "cell" rather than the single true layer. And the experiments are shown to have good performance as compared with baselines which considers explicit layers -->